Lecture 13 — AI Topic II: Token Economics & Project ROI¶

Session Info

Date: Monday, 2026-06-15 · Time: 16:20–18:10 · Room: YF108 Instructor: Dr. Zhijiang Chen · Session 13 of 16 — last content lecture

▶ Open the lecture slides — 110-minute web deck

Using the slide deck

The slides are a self-contained web presentation (Reveal.js, no install). Keyboard shortcuts: F for full-screen, S for speaker view, Esc for slide overview, arrow keys to navigate. Designed for a single 110-minute session. Lectures 14–15 are student presentations; Lecture 16 is the final exam — today is the last new material.

Learning Objectives¶

By the end of this lecture, you should be able to:

Decompose any LLM bill into its three token components — input, output, cached — and explain why output dominates and cache wins.
Read the 2026 pricing landscape for the major frontier models and choose a model by task tier, not by brand.
Quantify agent fan-out: estimate the 6–13 LLM calls behind a single user request and the $0.20–$8 per-task cost band that results.
Apply the three-tier caching stack — prompt cache, semantic cache, KV cache — and estimate hit rate by workload pattern.
Build a payback-period analysis for an AI initiative using industry-median benchmarks (4–14 months by use case).
Synthesize the course into a single decision framework: when to apply FP, COCOMO II, NPV/IRR, sensitivity, and AI cost modelling — and recognise the structure of next week's final exam.

Lecture Map — Six Movements, One Bill, One Course Recap¶

Today is the last new content of the semester. Sections I–V give you the missing piece — the modern AI ledger expressed as dollars per task, not dollars per million tokens. Section VI wraps the whole course in five lines and previews the final exam.

§	Topic	Minutes
I	The token cost stack — input / output / cached	15
II	The 2026 LLM pricing landscape	10
III	Agent cost multipliers — 3–10× a chatbot	15
IV	Caching strategies — where 40–90% savings hide	20
—	Discussion: which cache wins for your project?	10
V	Payback periods by industry — 2026 benchmarks	15
VI	Course wrap-up & final-exam preview	20
—	HW13, Q&A	5

Prerequisite from Lecture 2 and Lecture 12

This lecture assumes you remember the AI-era ledger lines from Lecture 2 (§ VI) and the productivity-adjusted COCOMO from Lecture 12. Today we put dollars on those lines and connect them back to the NPV / payback machinery from Lectures 5–6.

§ I. The Token Cost Stack — Three Components, Three Prices¶

The cost of an LLM call is not a single number. It is a sum of three priced quantities, each measured per million tokens, each behaving differently as your system scales. Engineers who quote "we use Sonnet, it's $3/M" are quoting one of three numbers — usually the cheapest one — and then are surprised at the bill.

The three token types¶

Type	What it is	Relative cost (2026)	Why it costs what it costs
Input tokens	Everything you send to the model — system prompt, user message, retrieved context, tool definitions.	1× base	The model must read every token. Cost scales linearly with context length.
Output tokens	Everything the model generates — response text, tool calls, reasoning traces.	3–5× input	Generation is inherently more compute-intensive than reading — each output token requires a fresh forward pass.
Cached input tokens	Input tokens already seen, served from the provider's prompt cache.	~0.1× input (≈ 90 % cheaper)	The expensive KV-state computation has already been done; the provider replays it.

Why output is 3–5× input — a one-paragraph intuition¶

When the model reads input, it processes the whole sequence in parallel: one forward pass over $n$ tokens. When the model generates output, it must produce each token sequentially: one forward pass per token, and each pass attends back to the entire prior context. The arithmetic is roughly $O(n)$ for input and $O(n \cdot m)$ for output of length $m$ — and at production scale, the GPU time per output token is several times that of an input token. Providers price what they pay for.

Why cache is ~90 % cheaper¶

A prompt cache stores the attention key/value tensors for a long prefix (system prompt, codebase, document) so that the next call with the same prefix skips the expensive prefill. The provider has already paid the compute cost once and amortises it across reuses. For workloads where 80% of every request is the same 30K-token system prompt, this is the single largest line-item optimisation available.

The cost structure rewards three product decisions¶

Design choice	Why it lowers the bill
Short outputs	Output is the expensive multiplier. A 200-token reply costs less than a 2,000-token one — by exactly 10×.
Structured outputs (JSON schema, function calls)	Forces brevity; eliminates filler tokens; makes downstream parsing free.
Stable, repeated prompts	Triggers prompt cache. A system prompt that changes on every request cannot be cached.

Working principle for this lecture

Your model's pricing is the boundary of your product economics. Design the product to live inside that boundary, or the boundary will design the product for you.

§ II. The 2026 LLM Pricing Landscape — Shop Before You Commit¶

The frontier-model market in 2026 is competitive enough that the cheapest viable option for a given task is usually 10–50× cheaper than the most capable. There is no single right answer; there is a task-tiered answer.

The major models, side by side (mid-2026)¶

Model	Input $/MTok	Output $/MTok	Cached input	Position
Claude Opus 4.7	$15.00	$75.00	$1.50	Frontier reasoning, complex agents
Claude Sonnet 4.6	$3.00	$15.00	$0.30	Workhorse — most production agents
GPT-4o	$2.50	$10.00	$1.25	General-purpose chat, multimodal
GPT-4o mini	$0.15	$0.60	$0.075	High-volume classification, routing
Gemini 2.5 Pro	$3.50	$10.50	$0.50	Long context (1M+), grounded search

Two observations worth memorising.

GPT-4o mini's input price is 1% of Opus 4.7. Routing easy traffic to a small model is the single highest-leverage architectural decision in a 2026 AI product.
Cached-input prices are 2–10% of fresh input. The gap between teams that cache well and teams that don't is roughly an order of magnitude on the monthly bill.

Mix tiers by task difficulty — the standard production pattern¶

Mature production stacks rarely use one model for everything. The dominant pattern is a three-tier router:

Tier	Model class	Use for	Share of traffic
Cheap	GPT-4o mini, Haiku-class	Classification, routing, simple Q&A, schema extraction	60–80 %
Workhorse	Sonnet 4.6, GPT-4o	Tool use, multi-step reasoning, drafting	15–30 %
Frontier	Opus 4.7, Gemini 2.5 Pro	Hardest 5%, escalations, safety-critical	2–10 %

A team that runs everything on Opus is paying roughly 50× what a tiered stack would cost for the same product. A team that runs everything on mini is shipping wrong answers on the hard 5%. The competitive engineering choice is the router, not the model.

Prices fluctuate

The numbers above are mid-2026 list prices. Vendor pricing has moved at roughly 30–50% per year since 2023, almost always downward. Re-check before committing a procurement; build your cost model with a price_per_mtok variable, not a constant.

§ III. Agent Cost Multipliers — Why One User Request Becomes Ten LLM Calls¶

A chatbot makes one LLM call per user message. An agent makes 6–13. This is the single largest source of cost surprise in 2026 AI deployments.

The five phases of an agent task¶

Phase	Typical calls	What happens
Planning	1–2	Decompose the user request into sub-steps; choose a strategy.
Tool selection / arg-building	1–3	Pick which APIs or tools to invoke; generate well-formed arguments.
Execution & iteration	2–5	Run tool, inspect result, decide next step. This is where loops happen.
Verification	1–2	Re-check the answer for correctness, format, or policy violations.
Response synthesis	1	Compose the user-facing answer.
Typical total	6–13	Per user task

An unconstrained agent — one with no max-iteration cap, no model routing, and no caching — costs $5–8 per task on a frontier model. Budgets that estimate "1 call × $0.01 = $0.01 per request" are wrong by 500–800×. Most AI cost overruns in 2025–2026 trace back to this single mismodelling.

Worked example — One user task, Claude Sonnet 4.6, full cost¶

A mid-complexity agent task: a developer asks the assistant to "investigate why test X is flaky and propose a fix." The agent reads the test, searches the codebase, runs the test three times, verifies its hypothesis, and writes a response.

Call	Input tokens	Output tokens	Cost (Sonnet 4.6)
Plan	5,000	500	$0.0225
Tool #1 — read test file	12,000	200	$0.0390
Tool #2 — codebase search	8,000	400	$0.0300
Iterate (3 calls)	30,000	1,200	$0.1080
Verify	10,000	300	$0.0345
Synthesize	12,000	800	$0.0480
Total per task	77,000	3,400	$0.282

The arithmetic: input is $77{,}000 \times \$3 / 10^6 = \$0.231$; output is $3{,}400 \times \$15 / 10^6 = \$0.051$; sum $\approx \$0.282$. Note that output is only 4.4% of token count but 18% of cost — because of the 5× multiplier.

Scaling to a real product¶

At 100,000 tasks per month (a mid-market SaaS):

\[ \text{Monthly inference} = 100{,}000 \times \$0.282 = \$28{,}200 \]

\[ \text{Annual inference} = 12 \times \$28{,}200 = \$338{,}400 \]

This is inference only — it excludes hosting, observability, vector DB, and human verification. A realistic all-in budget is roughly 1.5–2× this number. And we have not yet turned on caching, which can cut the inference line by 50–70%.

The most common 2026 cost-model error

Per-token cost × tokens-per-call is not per-task cost. Multiply by fan-out (6–13 for agents) and retry rate (1.1–1.4× for production resilience). Skipping either factor produces an estimate that is wrong by an order of magnitude.

§ IV. Caching Strategies — Where 40–90 % of Savings Hide¶

Caching is the highest-leverage optimisation available to an AI product team. Three distinct tiers exist; they compose; and the gap between teams that stack all three and teams that stack none is roughly 5–10× on the monthly bill.

The three caching tiers¶

Tier	What it reuses	Where it lives	Cost / latency savings
① Prompt cache	Identical input prefixes (system prompt, large context, retrieved docs).	Provider-managed (Anthropic, OpenAI, Google all support it).	80–90 % off cached tokens.
② Semantic cache	Responses for similar (not identical) requests, matched by embedding distance.	Application-side; small vector DB.	40–70 % on cacheable workloads.
③ KV cache reuse	The attention key/value state across decode steps.	Provider-internal; partially exposed via batch APIs.	75 % latency cut; modest cost savings.

They compose — prompt cache reduces fresh-input cost; semantic cache eliminates the call entirely when an existing response will do; KV reuse cuts latency. A workload routed through all three pays a fraction of the naive bill.

Hit rate by workload pattern¶

Caching savings depend on cache hit rate, which depends on the shape of your traffic. Below are typical ranges from 2026 production deployments.

Workload pattern	Typical cache hit %	Resulting cost reduction
Customer support — repeated FAQs	60–80 %	50–70 %
Code assistant — repeated codebase context	70–90 %	60–80 %
Search / chat — unique queries	15–30 %	10–25 %
Document analysis — long shared prefix, varying tail	85–95 %	70–85 %
Agentic workflow — repeated planning prompts	60–80 %	50–70 %

Read this table conservatively. Use the lower bound for budgeting. If you happen to land at the upper bound in production, that is a pleasant variance; if you assumed the upper bound and land at the lower bound, you have under-quoted the project.

A simple model — what caching does to the worked example¶

Recall the $0.282 per-task cost from § III. Suppose we add prompt caching with a 70% hit rate on input tokens. The new input cost per task is:

\[ 0.30 \times 77{,}000 \times \$3/10^6 + 0.70 \times 77{,}000 \times \$0.30/10^6 \approx \$0.0693 + \$0.0162 \approx \$0.0855 \]

Output cost is unchanged at $0.051. New per-task total: **$0.137, a 51% reduction from $0.282. At 100K tasks/month, annual savings ≈ **$174K. One engineer-month of prompt-cache engineering routinely pays back in under 30 days.

The single most under-used optimisation in 2026

If your AI product does not use prompt caching, that is almost certainly the highest-ROI engineering ticket on your backlog. Estimate the savings, write the ticket, ship it this quarter.

Class Discussion — Which Caching Tier Wins for Your Project?¶

Open floor, ~10 minutes. In pairs (4 min), then share.

For your group-project AI feature:

Categorise your workload — which of the five patterns in § IV does it most resemble? FAQs? Code context? Unique queries? Long shared prefix? Agentic?
Estimate hit rate — pick a defensible number from the table's range, and write down why you picked the low end vs. the high end.
Compute monthly savings — using your token-spend estimate from Homework 12, what would each caching tier save?
Three sanity questions the slide deck poses:
- What fraction of your requests share a long input prefix? That is your prompt-cache opportunity.
- Could two requests safely share the same response? That is your semantic-cache opportunity.
- What experiment would prove your hit-rate assumption before committing engineering effort? (Hint: a one-day log analysis usually settles it.)

Three pairs will share which tier they'd implement first and why.

§ V. Payback Periods by Industry — 2026 Benchmarks¶

The single most important number in any AI business case is the payback period: how many months of net savings cover the build cost. Industry data from 2026 surveys (VentureBeat's 1,100-engineer-and-CTO study; Digital Applied's 100+ ROI data points; McKinsey's AI ROI tracker) shows striking variation by use case.

Median payback by use case¶

Use case	Median payback	Why fast / slow
Customer support	4.1 months	High labour-cost displacement, narrow domain, easy to measure deflection.
Marketing operations	6.7 months	High volume of low-criticality work; easy to A/B test.
Sales enablement	7.5 months	Conversion lift offsets cost; harder attribution.
Engineering productivity	9.3 months	Senior-engineer verification overhead eats a big share of the gain.
Compliance / legal	14 months	High accuracy bar; many false positives require human triage.

The shape of the table is the lesson. Payback is fastest where (a) AI displaces high-marginal-cost human labour, (b) the domain is narrow enough to verify cheaply, and (c) errors are tolerable. Payback is slowest where AI must clear a high accuracy bar in a broad domain — exactly where the human verification cost dominates.

The 18-month rule of thumb

For your group project: an AI feature with payback greater than 18 months is rarely funded in the current capital environment. If your model lands beyond 18 months, you have two levers — halve the cost (cheaper model, caching, smaller scope) or double the value (broader rollout, higher-leverage use case). Pick one and rebuild.

Worked mid-market case — A 2.5-month payback¶

A hypothetical SaaS company: 8,000 monthly support tickets, $18.40 average resolution cost (loaded agent salary, tooling, overhead). The company deploys an AI support agent that deflects 34% of tickets (industry-typical for a well-trained domain-tuned bot).

Item	Monthly amount
Tickets deflected ($8{,}000 \times 0.34$)	2,720
Resolution cost saved ($2{,}720 \times \$12.20$ net)	+$33,184
AI infrastructure (tokens + observability + LLM ops)	−$3,800
Monthly net benefit	+$29,384
One-time build cost	$72,000
Simple payback = $72{,}000 / \$29{,}384$	≈ 2.5 months
Discounted payback @ MARR = 12 % (monthly $i = 0.95\%$)	≈ 2.6 months

The arithmetic uses Lecture 6's payback definition. Note that deflection-cost savings are slightly less than the headline $18.40 — the $12.20 figure assumes a partial deflection (some tickets still escalate to a human, who now starts mid-conversation), which is the honest way to model it. Optimists pencil in $18.40 and quote a 1.7-month payback; the project ships and they discover the real number is 2.5–3 months.

Why this case pays back so fast — it satisfies all three conditions: AI displaces high-cost labour ($18.40/ticket); the domain is narrow (this product's support); errors are recoverable (human escalation path exists). A compliance-screening AI with the same build cost would pay back 5–8× more slowly because verification swallows the gain.

Working principle for AI ROI

When AI pays back fast, it is almost always because it displaces high-marginal-cost labour at high volume. When it pays back slowly, it is almost always because a human still has to verify each output.

§ VI. Course Wrap-Up — Thirteen Sessions in Five Lines¶

Today is the last new content of the semester. Lectures 14–15 are student presentations; Lecture 16 is the final exam. So this is the right moment to compress thirteen lectures into something portable.

The whole course in five lines¶

Every software decision is an economic decision. Boehm's seven-step framework applies at any scale — from a one-week feature to a five-year platform.
Cost lives in a four-axis taxonomy (fixed/variable × direct/indirect × non-recurring/recurring × capex/opex) across five lifecycle phases. 60–80% of lifetime cost lives after launch — Boehm's iceberg, restated every year since 1981.
Time value of money is the calculus of comparison. NPV is the verdict; IRR, payback, and PI are the footnotes. Equivalence is the operational definition under all of them.
Sensitivity tells you which assumption to research; risk analysis tells you the probability of being wrong. Tornado diagrams, Monte Carlo, decision trees — different tools for the same question.
The AI era changes the coefficients, not the equations. Same Boehm framework, recalibrated for tokens, agent fan-out, caching, and a re-allocated productivity curve (Lecture 12).

Working principle for the rest of your career

If you walk away with only those five lines, you can defend any software decision in any room you'll enter for the next decade.

Final exam preview — Lecture 16, Thursday 18 June¶

Section	Points	Material
Multiple choice / short answer	20	Definitions, frameworks, intuitions from all 13 lectures. ~20 questions, 1 pt each.
Computation	50	PV/FV, NPV/IRR, equivalence, sensitivity, FP, COCOMO II, token-cost arithmetic. ~5 problems.
Integrative case	30	A realistic project — full economic analysis including an AI cost dimension. One long question.

Logistics. Closed book. One A4 cheat sheet, single-sided, handwritten, permitted. Non-programmable calculator. 110 minutes, room TBA. Bring a pencil and a backup pen.

What I will not test. Memorisation of factor-table values (you'll be given any non-trivial factor). Memorisation of 2026 vendor pricing (the test gives the prices). The names of every paper in the references. Concepts and computation are the test.

How to study. Re-do Homework 1, 4, 6, and 12 from scratch on a blank sheet. If you can finish each in 30 minutes with the right answer and a clean CFD, you are ready. The integrative case is built from those four shapes plus the AI-cost overlay from today.

Homework 13 — Final Project Polish¶

Due: Tuesday 2026-06-16, before Lecture 14 (your team's presentation). Late: not accepted — presentation slot is fixed.

Finalise your group-project presentation deck. Required slides:
- Scope — what the project is, who the user is, the decision being made.
- FP / COCOMO II estimate — effort, duration, cost; reference Lecture 9–10.
- Cash flow + two alternatives — CFD for the recommended option and one realistic comparator; Lecture 4.
- NPV / IRR / Payback / PI — at your stated MARR; Lectures 5–6.
- Sensitivity analysis — tornado diagram on the three biggest assumptions; Lecture 7.
- AI-cost dimension (bonus, +5 pts) — token spend, fan-out, caching savings, AI-feature payback; today's lecture.
- Recommendation — one slide, one sentence at the top.
Rehearse the talk to the clock: 12 min presentation + 5 min Q&A + 3 min peer evaluation. Going long is the most common preventable failure.
Submit the final project report PDF to submissions/PROJECT/<team-name>/final-report.pdf before the start of Lecture 14.

Office hours. Tomorrow 10:00–11:30 (final pre-presentation slot) and Wednesday 13:00–14:00 (after Lecture 15, for exam questions).

Class Discussion Prompts — Wrap-Up¶

Open floor, ~10 minutes (overflow from the cache discussion).

Pick one AI-era cost line from today (tokens, fan-out, cache, verification, payback gap). Argue in one paragraph whether your group project's recommendation survives that cost line being 2× worse than your central estimate.
Re-read the five-line course summary above. Which line do you find most useful — and which do you find least convincing? Defend in plain English.
Looking forward five years: which AI-cost line do you expect to dominate in production by 2031 — inference, verification, evaluation, compliance, or something not yet on the ledger? Make a falsifiable prediction.

Key Vocabulary — Quick Reference¶

Term	Plain-language meaning
Input tokens	Tokens the model reads (prompt + context). Priced at the base rate.
Output tokens	Tokens the model generates. Priced at 3–5× input.
Cached input tokens	Input tokens replayed from a prompt cache. Priced at ~10% of input.
Prompt cache	Provider-managed cache of input prefixes; 80–90% off cached tokens.
Semantic cache	Application-side cache that reuses responses for similar requests.
KV cache	Provider-internal reuse of attention state; primarily a latency win.
Agent fan-out	Number of LLM calls per user task — typically 6–13 for an agent.
Per-task cost	Per-call cost × fan-out × retry rate. The only honest unit of AI cost.
Cache hit rate	Fraction of input tokens served from cache. Depends on workload shape.
Deflection rate	Fraction of human-bound work absorbed by the AI. Drives the savings line.
Verification overhead	Engineer or analyst time spent reviewing AI output. The hidden cost killer.
Payback period	Months of net benefit needed to recover the build cost; Lecture 6 definition.
Three-tier router	Cheap / workhorse / frontier model selection by task difficulty.

What Today Bought You¶

Token cost = input + output (3–5× input) + cache (~10% of input). Output dominates; cache wins.
Agents fan out 6–13×. Per-call cost is not per-task cost. Most cost overruns trace to this confusion.
Caching is the highest-ROI optimisation available. 40–90% savings in production is standard, not exceptional.
Payback depends on how much human labour you displace and how cheaply. 4–14 months by industry is the 2026 band; >18 months rarely funds.
The whole course compresses to five lines. Boehm still holds; only the coefficients have moved.

This is the last content lecture of Software Engineering Economics, Summer 2026. The vocabulary you now share — NPV, IRR, sensitivity, payback, tokens, fan-out, cache — is the lingua franca of every engineering economics discussion you will join for the next decade. Use it well.

Prepared by Dr. Zhijiang Chen — Frostburg State University, Summer 2026.

Item	Monthly amount
Tickets deflected (\(8{,}000 \times 0.34\))	2,720
Resolution cost saved (\(2{,}720 \times \$12.20\) net)	+$33,184
AI infrastructure (tokens + observability + LLM ops)	−$3,800
Monthly net benefit	+$29,384
One-time build cost	$72,000
Simple payback = \(72{,}000 / \$29{,}384\)	≈ 2.5 months
Discounted payback @ MARR = 12 % (monthly \(i = 0.95\%\))	≈ 2.6 months