Lecture 13 — AI Topic II: Token Economics & Project ROI¶
Session Info
Date: Monday, 2026-06-15 · Time: 16:20–18:10 · Room: YF108 Instructor: Dr. Zhijiang Chen · Session 13 of 16 — last content lecture
▶ Open the lecture slides — 110-minute web deck
Using the slide deck
The slides are a self-contained web presentation (Reveal.js, no install). Keyboard shortcuts: F for full-screen, S for speaker view, Esc for slide overview, arrow keys to navigate. Designed for a single 110-minute session. Lectures 14–15 are student presentations; Lecture 16 is the final exam — today is the last new material.
Learning Objectives¶
By the end of this lecture, you should be able to:
- Decompose any LLM bill into its three token components — input, output, cached — and explain why output dominates and cache wins.
- Read the 2026 pricing landscape for the major frontier models and choose a model by task tier, not by brand.
- Quantify agent fan-out: estimate the 6–13 LLM calls behind a single user request and the \(0.20–\)8 per-task cost band that results.
- Apply the three-tier caching stack — prompt cache, semantic cache, KV cache — and estimate hit rate by workload pattern.
- Build a payback-period analysis for an AI initiative using industry-median benchmarks (4–14 months by use case).
- Synthesize the course into a single decision framework: when to apply FP, COCOMO II, NPV/IRR, sensitivity, and AI cost modelling — and recognise the structure of next week's final exam.
Lecture Map — Six Movements, One Bill, One Course Recap¶
Today is the last new content of the semester. Sections I–V give you the missing piece — the modern AI ledger expressed as dollars per task, not dollars per million tokens. Section VI wraps the whole course in five lines and previews the final exam.
| § | Topic | Minutes |
|---|---|---|
| I | The token cost stack — input / output / cached | 15 |
| II | The 2026 LLM pricing landscape | 10 |
| III | Agent cost multipliers — 3–10× a chatbot | 15 |
| IV | Caching strategies — where 40–90% savings hide | 20 |
| — | Discussion: which cache wins for your project? | 10 |
| V | Payback periods by industry — 2026 benchmarks | 15 |
| VI | Course wrap-up & final-exam preview | 20 |
| — | HW13, Q&A | 5 |
Prerequisite from Lecture 2 and Lecture 12
This lecture assumes you remember the AI-era ledger lines from Lecture 2 (§ VI) and the productivity-adjusted COCOMO from Lecture 12. Today we put dollars on those lines and connect them back to the NPV / payback machinery from Lectures 5–6.
§ I. The Token Cost Stack — Three Components, Three Prices¶
The cost of an LLM call is not a single number. It is a sum of three priced quantities, each measured per million tokens, each behaving differently as your system scales. Engineers who quote "we use Sonnet, it's $3/M" are quoting one of three numbers — usually the cheapest one — and then are surprised at the bill.
The three token types¶
| Type | What it is | Relative cost (2026) | Why it costs what it costs |
|---|---|---|---|
| Input tokens | Everything you send to the model — system prompt, user message, retrieved context, tool definitions. | 1× base | The model must read every token. Cost scales linearly with context length. |
| Output tokens | Everything the model generates — response text, tool calls, reasoning traces. | 3–5× input | Generation is inherently more compute-intensive than reading — each output token requires a fresh forward pass. |
| Cached input tokens | Input tokens already seen, served from the provider's prompt cache. | ~0.1× input (≈ 90 % cheaper) | The expensive KV-state computation has already been done; the provider replays it. |
Why output is 3–5× input — a one-paragraph intuition¶
When the model reads input, it processes the whole sequence in parallel: one forward pass over \(n\) tokens. When the model generates output, it must produce each token sequentially: one forward pass per token, and each pass attends back to the entire prior context. The arithmetic is roughly \(O(n)\) for input and \(O(n \cdot m)\) for output of length \(m\) — and at production scale, the GPU time per output token is several times that of an input token. Providers price what they pay for.
Why cache is ~90 % cheaper¶
A prompt cache stores the attention key/value tensors for a long prefix (system prompt, codebase, document) so that the next call with the same prefix skips the expensive prefill. The provider has already paid the compute cost once and amortises it across reuses. For workloads where 80% of every request is the same 30K-token system prompt, this is the single largest line-item optimisation available.
The cost structure rewards three product decisions¶
| Design choice | Why it lowers the bill |
|---|---|
| Short outputs | Output is the expensive multiplier. A 200-token reply costs less than a 2,000-token one — by exactly 10×. |
| Structured outputs (JSON schema, function calls) | Forces brevity; eliminates filler tokens; makes downstream parsing free. |
| Stable, repeated prompts | Triggers prompt cache. A system prompt that changes on every request cannot be cached. |
Working principle for this lecture
Your model's pricing is the boundary of your product economics. Design the product to live inside that boundary, or the boundary will design the product for you.
§ II. The 2026 LLM Pricing Landscape — Shop Before You Commit¶
The frontier-model market in 2026 is competitive enough that the cheapest viable option for a given task is usually 10–50× cheaper than the most capable. There is no single right answer; there is a task-tiered answer.
The major models, side by side (mid-2026)¶
| Model | Input $/MTok | Output $/MTok | Cached input | Position |
|---|---|---|---|---|
| Claude Opus 4.7 | $15.00 | $75.00 | $1.50 | Frontier reasoning, complex agents |
| Claude Sonnet 4.6 | $3.00 | $15.00 | $0.30 | Workhorse — most production agents |
| GPT-4o | $2.50 | $10.00 | $1.25 | General-purpose chat, multimodal |
| GPT-4o mini | $0.15 | $0.60 | $0.075 | High-volume classification, routing |
| Gemini 2.5 Pro | $3.50 | $10.50 | $0.50 | Long context (1M+), grounded search |
Two observations worth memorising.
- GPT-4o mini's input price is 1% of Opus 4.7. Routing easy traffic to a small model is the single highest-leverage architectural decision in a 2026 AI product.
- Cached-input prices are 2–10% of fresh input. The gap between teams that cache well and teams that don't is roughly an order of magnitude on the monthly bill.
Mix tiers by task difficulty — the standard production pattern¶
Mature production stacks rarely use one model for everything. The dominant pattern is a three-tier router:
| Tier | Model class | Use for | Share of traffic |
|---|---|---|---|
| Cheap | GPT-4o mini, Haiku-class | Classification, routing, simple Q&A, schema extraction | 60–80 % |
| Workhorse | Sonnet 4.6, GPT-4o | Tool use, multi-step reasoning, drafting | 15–30 % |
| Frontier | Opus 4.7, Gemini 2.5 Pro | Hardest 5%, escalations, safety-critical | 2–10 % |
A team that runs everything on Opus is paying roughly 50× what a tiered stack would cost for the same product. A team that runs everything on mini is shipping wrong answers on the hard 5%. The competitive engineering choice is the router, not the model.
Prices fluctuate
The numbers above are mid-2026 list prices. Vendor pricing has moved at roughly 30–50% per year since 2023, almost always downward. Re-check before committing a procurement; build your cost model with a price_per_mtok variable, not a constant.
§ III. Agent Cost Multipliers — Why One User Request Becomes Ten LLM Calls¶
A chatbot makes one LLM call per user message. An agent makes 6–13. This is the single largest source of cost surprise in 2026 AI deployments.
The five phases of an agent task¶
| Phase | Typical calls | What happens |
|---|---|---|
| Planning | 1–2 | Decompose the user request into sub-steps; choose a strategy. |
| Tool selection / arg-building | 1–3 | Pick which APIs or tools to invoke; generate well-formed arguments. |
| Execution & iteration | 2–5 | Run tool, inspect result, decide next step. This is where loops happen. |
| Verification | 1–2 | Re-check the answer for correctness, format, or policy violations. |
| Response synthesis | 1 | Compose the user-facing answer. |
| Typical total | 6–13 | Per user task |
An unconstrained agent — one with no max-iteration cap, no model routing, and no caching — costs $5–8 per task on a frontier model. Budgets that estimate "1 call × $0.01 = $0.01 per request" are wrong by 500–800×. Most AI cost overruns in 2025–2026 trace back to this single mismodelling.
Worked example — One user task, Claude Sonnet 4.6, full cost¶
A mid-complexity agent task: a developer asks the assistant to "investigate why test X is flaky and propose a fix." The agent reads the test, searches the codebase, runs the test three times, verifies its hypothesis, and writes a response.
| Call | Input tokens | Output tokens | Cost (Sonnet 4.6) |
|---|---|---|---|
| Plan | 5,000 | 500 | $0.0225 |
| Tool #1 — read test file | 12,000 | 200 | $0.0390 |
| Tool #2 — codebase search | 8,000 | 400 | $0.0300 |
| Iterate (3 calls) | 30,000 | 1,200 | $0.1080 |
| Verify | 10,000 | 300 | $0.0345 |
| Synthesize | 12,000 | 800 | $0.0480 |
| Total per task | 77,000 | 3,400 | $0.282 |
The arithmetic: input is \(77{,}000 \times \$3 / 10^6 = \$0.231\); output is \(3{,}400 \times \$15 / 10^6 = \$0.051\); sum \(\approx \$0.282\). Note that output is only 4.4% of token count but 18% of cost — because of the 5× multiplier.
Scaling to a real product¶
At 100,000 tasks per month (a mid-market SaaS):
This is inference only — it excludes hosting, observability, vector DB, and human verification. A realistic all-in budget is roughly 1.5–2× this number. And we have not yet turned on caching, which can cut the inference line by 50–70%.
The most common 2026 cost-model error
Per-token cost × tokens-per-call is not per-task cost. Multiply by fan-out (6–13 for agents) and retry rate (1.1–1.4× for production resilience). Skipping either factor produces an estimate that is wrong by an order of magnitude.
§ IV. Caching Strategies — Where 40–90 % of Savings Hide¶
Caching is the highest-leverage optimisation available to an AI product team. Three distinct tiers exist; they compose; and the gap between teams that stack all three and teams that stack none is roughly 5–10× on the monthly bill.
The three caching tiers¶
| Tier | What it reuses | Where it lives | Cost / latency savings |
|---|---|---|---|
| ① Prompt cache | Identical input prefixes (system prompt, large context, retrieved docs). | Provider-managed (Anthropic, OpenAI, Google all support it). | 80–90 % off cached tokens. |
| ② Semantic cache | Responses for similar (not identical) requests, matched by embedding distance. | Application-side; small vector DB. | 40–70 % on cacheable workloads. |
| ③ KV cache reuse | The attention key/value state across decode steps. | Provider-internal; partially exposed via batch APIs. | 75 % latency cut; modest cost savings. |
They compose — prompt cache reduces fresh-input cost; semantic cache eliminates the call entirely when an existing response will do; KV reuse cuts latency. A workload routed through all three pays a fraction of the naive bill.
Hit rate by workload pattern¶
Caching savings depend on cache hit rate, which depends on the shape of your traffic. Below are typical ranges from 2026 production deployments.
| Workload pattern | Typical cache hit % | Resulting cost reduction |
|---|---|---|
| Customer support — repeated FAQs | 60–80 % | 50–70 % |
| Code assistant — repeated codebase context | 70–90 % | 60–80 % |
| Search / chat — unique queries | 15–30 % | 10–25 % |
| Document analysis — long shared prefix, varying tail | 85–95 % | 70–85 % |
| Agentic workflow — repeated planning prompts | 60–80 % | 50–70 % |
Read this table conservatively. Use the lower bound for budgeting. If you happen to land at the upper bound in production, that is a pleasant variance; if you assumed the upper bound and land at the lower bound, you have under-quoted the project.
A simple model — what caching does to the worked example¶
Recall the $0.282 per-task cost from § III. Suppose we add prompt caching with a 70% hit rate on input tokens. The new input cost per task is:
Output cost is unchanged at \(0.051. New per-task total: **\)0.137, a 51% reduction from \(0.282. At 100K tasks/month, annual savings ≈ **\)174K. One engineer-month of prompt-cache engineering routinely pays back in under 30 days.
The single most under-used optimisation in 2026
If your AI product does not use prompt caching, that is almost certainly the highest-ROI engineering ticket on your backlog. Estimate the savings, write the ticket, ship it this quarter.
Class Discussion — Which Caching Tier Wins for Your Project?¶
Open floor, ~10 minutes. In pairs (4 min), then share.
For your group-project AI feature:
- Categorise your workload — which of the five patterns in § IV does it most resemble? FAQs? Code context? Unique queries? Long shared prefix? Agentic?
- Estimate hit rate — pick a defensible number from the table's range, and write down why you picked the low end vs. the high end.
- Compute monthly savings — using your token-spend estimate from Homework 12, what would each caching tier save?
- Three sanity questions the slide deck poses:
- What fraction of your requests share a long input prefix? That is your prompt-cache opportunity.
- Could two requests safely share the same response? That is your semantic-cache opportunity.
- What experiment would prove your hit-rate assumption before committing engineering effort? (Hint: a one-day log analysis usually settles it.)
Three pairs will share which tier they'd implement first and why.
§ V. Payback Periods by Industry — 2026 Benchmarks¶
The single most important number in any AI business case is the payback period: how many months of net savings cover the build cost. Industry data from 2026 surveys (VentureBeat's 1,100-engineer-and-CTO study; Digital Applied's 100+ ROI data points; McKinsey's AI ROI tracker) shows striking variation by use case.
Median payback by use case¶
| Use case | Median payback | Why fast / slow |
|---|---|---|
| Customer support | 4.1 months | High labour-cost displacement, narrow domain, easy to measure deflection. |
| Marketing operations | 6.7 months | High volume of low-criticality work; easy to A/B test. |
| Sales enablement | 7.5 months | Conversion lift offsets cost; harder attribution. |
| Engineering productivity | 9.3 months | Senior-engineer verification overhead eats a big share of the gain. |
| Compliance / legal | 14 months | High accuracy bar; many false positives require human triage. |
The shape of the table is the lesson. Payback is fastest where (a) AI displaces high-marginal-cost human labour, (b) the domain is narrow enough to verify cheaply, and (c) errors are tolerable. Payback is slowest where AI must clear a high accuracy bar in a broad domain — exactly where the human verification cost dominates.
The 18-month rule of thumb
For your group project: an AI feature with payback greater than 18 months is rarely funded in the current capital environment. If your model lands beyond 18 months, you have two levers — halve the cost (cheaper model, caching, smaller scope) or double the value (broader rollout, higher-leverage use case). Pick one and rebuild.
Worked mid-market case — A 2.5-month payback¶
A hypothetical SaaS company: 8,000 monthly support tickets, $18.40 average resolution cost (loaded agent salary, tooling, overhead). The company deploys an AI support agent that deflects 34% of tickets (industry-typical for a well-trained domain-tuned bot).
| Item | Monthly amount |
|---|---|
| Tickets deflected (\(8{,}000 \times 0.34\)) | 2,720 |
| Resolution cost saved (\(2{,}720 \times \$12.20\) net) | +$33,184 |
| AI infrastructure (tokens + observability + LLM ops) | −$3,800 |
| Monthly net benefit | +$29,384 |
| One-time build cost | $72,000 |
| Simple payback = \(72{,}000 / \$29{,}384\) | ≈ 2.5 months |
| Discounted payback @ MARR = 12 % (monthly \(i = 0.95\%\)) | ≈ 2.6 months |
The arithmetic uses Lecture 6's payback definition. Note that deflection-cost savings are slightly less than the headline $18.40 — the $12.20 figure assumes a partial deflection (some tickets still escalate to a human, who now starts mid-conversation), which is the honest way to model it. Optimists pencil in $18.40 and quote a 1.7-month payback; the project ships and they discover the real number is 2.5–3 months.
Why this case pays back so fast — it satisfies all three conditions: AI displaces high-cost labour ($18.40/ticket); the domain is narrow (this product's support); errors are recoverable (human escalation path exists). A compliance-screening AI with the same build cost would pay back 5–8× more slowly because verification swallows the gain.
Working principle for AI ROI
When AI pays back fast, it is almost always because it displaces high-marginal-cost labour at high volume. When it pays back slowly, it is almost always because a human still has to verify each output.
§ VI. Course Wrap-Up — Thirteen Sessions in Five Lines¶
Today is the last new content of the semester. Lectures 14–15 are student presentations; Lecture 16 is the final exam. So this is the right moment to compress thirteen lectures into something portable.
The whole course in five lines¶
- Every software decision is an economic decision. Boehm's seven-step framework applies at any scale — from a one-week feature to a five-year platform.
- Cost lives in a four-axis taxonomy (fixed/variable × direct/indirect × non-recurring/recurring × capex/opex) across five lifecycle phases. 60–80% of lifetime cost lives after launch — Boehm's iceberg, restated every year since 1981.
- Time value of money is the calculus of comparison. NPV is the verdict; IRR, payback, and PI are the footnotes. Equivalence is the operational definition under all of them.
- Sensitivity tells you which assumption to research; risk analysis tells you the probability of being wrong. Tornado diagrams, Monte Carlo, decision trees — different tools for the same question.
- The AI era changes the coefficients, not the equations. Same Boehm framework, recalibrated for tokens, agent fan-out, caching, and a re-allocated productivity curve (Lecture 12).
Working principle for the rest of your career
If you walk away with only those five lines, you can defend any software decision in any room you'll enter for the next decade.
Final exam preview — Lecture 16, Thursday 18 June¶
| Section | Points | Material |
|---|---|---|
| Multiple choice / short answer | 20 | Definitions, frameworks, intuitions from all 13 lectures. ~20 questions, 1 pt each. |
| Computation | 50 | PV/FV, NPV/IRR, equivalence, sensitivity, FP, COCOMO II, token-cost arithmetic. ~5 problems. |
| Integrative case | 30 | A realistic project — full economic analysis including an AI cost dimension. One long question. |
Logistics. Closed book. One A4 cheat sheet, single-sided, handwritten, permitted. Non-programmable calculator. 110 minutes, room TBA. Bring a pencil and a backup pen.
What I will not test. Memorisation of factor-table values (you'll be given any non-trivial factor). Memorisation of 2026 vendor pricing (the test gives the prices). The names of every paper in the references. Concepts and computation are the test.
How to study. Re-do Homework 1, 4, 6, and 12 from scratch on a blank sheet. If you can finish each in 30 minutes with the right answer and a clean CFD, you are ready. The integrative case is built from those four shapes plus the AI-cost overlay from today.
Homework 13 — Final Project Polish¶
Due: Tuesday 2026-06-16, before Lecture 14 (your team's presentation). Late: not accepted — presentation slot is fixed.
- Finalise your group-project presentation deck. Required slides:
- Scope — what the project is, who the user is, the decision being made.
- FP / COCOMO II estimate — effort, duration, cost; reference Lecture 9–10.
- Cash flow + two alternatives — CFD for the recommended option and one realistic comparator; Lecture 4.
- NPV / IRR / Payback / PI — at your stated MARR; Lectures 5–6.
- Sensitivity analysis — tornado diagram on the three biggest assumptions; Lecture 7.
- AI-cost dimension (bonus, +5 pts) — token spend, fan-out, caching savings, AI-feature payback; today's lecture.
- Recommendation — one slide, one sentence at the top.
- Rehearse the talk to the clock: 12 min presentation + 5 min Q&A + 3 min peer evaluation. Going long is the most common preventable failure.
- Submit the final project report PDF to
submissions/PROJECT/<team-name>/final-report.pdfbefore the start of Lecture 14.
Office hours. Tomorrow 10:00–11:30 (final pre-presentation slot) and Wednesday 13:00–14:00 (after Lecture 15, for exam questions).
Class Discussion Prompts — Wrap-Up¶
Open floor, ~10 minutes (overflow from the cache discussion).
- Pick one AI-era cost line from today (tokens, fan-out, cache, verification, payback gap). Argue in one paragraph whether your group project's recommendation survives that cost line being 2× worse than your central estimate.
- Re-read the five-line course summary above. Which line do you find most useful — and which do you find least convincing? Defend in plain English.
- Looking forward five years: which AI-cost line do you expect to dominate in production by 2031 — inference, verification, evaluation, compliance, or something not yet on the ledger? Make a falsifiable prediction.
Further Reading — 2026 industry reports¶
- VentureBeat (2026). "The Hidden Economics of AI Agents: 1,100 Engineers and CTOs Surveyed" — empirical fan-out and cost data underlying § III.
- Stevens Online (2026). "Token Cost Stack: A Production Engineer's Guide" — the canonical 2026 reference for input/output/cache decomposition.
- Silicon Data (2026). "LLM Cost Per Token: 2026 Practical Guide" — month-by-month price history for the major vendors.
- Prem AI (2026). "LLM Cost Optimization: 8 Strategies That Compound" — extends the three-tier caching framework with batching, distillation, and routing.
- Digital Applied (2026). "AI Agent Productivity & ROI Statistics — 2026 Edition" — the 100+ ROI data points behind the payback table in § V.
- McKinsey Global Institute (2026). "The State of AI: 2026 Survey" — payback-period medians by industry and function.
- Anthropic Engineering Blog. "Prompt Caching for Production Workloads" — the canonical engineering reference for tier ① caching.
- (Back-pointer) Boehm (1981). Software Engineering Economics, Chapters 3 and 5 — the framework whose coefficients we just recalibrated.
Key Vocabulary — Quick Reference¶
| Term | Plain-language meaning |
|---|---|
| Input tokens | Tokens the model reads (prompt + context). Priced at the base rate. |
| Output tokens | Tokens the model generates. Priced at 3–5× input. |
| Cached input tokens | Input tokens replayed from a prompt cache. Priced at ~10% of input. |
| Prompt cache | Provider-managed cache of input prefixes; 80–90% off cached tokens. |
| Semantic cache | Application-side cache that reuses responses for similar requests. |
| KV cache | Provider-internal reuse of attention state; primarily a latency win. |
| Agent fan-out | Number of LLM calls per user task — typically 6–13 for an agent. |
| Per-task cost | Per-call cost × fan-out × retry rate. The only honest unit of AI cost. |
| Cache hit rate | Fraction of input tokens served from cache. Depends on workload shape. |
| Deflection rate | Fraction of human-bound work absorbed by the AI. Drives the savings line. |
| Verification overhead | Engineer or analyst time spent reviewing AI output. The hidden cost killer. |
| Payback period | Months of net benefit needed to recover the build cost; Lecture 6 definition. |
| Three-tier router | Cheap / workhorse / frontier model selection by task difficulty. |
What Today Bought You¶
- Token cost = input + output (3–5× input) + cache (~10% of input). Output dominates; cache wins.
- Agents fan out 6–13×. Per-call cost is not per-task cost. Most cost overruns trace to this confusion.
- Caching is the highest-ROI optimisation available. 40–90% savings in production is standard, not exceptional.
- Payback depends on how much human labour you displace and how cheaply. 4–14 months by industry is the 2026 band; >18 months rarely funds.
- The whole course compresses to five lines. Boehm still holds; only the coefficients have moved.
This is the last content lecture of Software Engineering Economics, Summer 2026. The vocabulary you now share — NPV, IRR, sensitivity, payback, tokens, fan-out, cache — is the lingua franca of every engineering economics discussion you will join for the next decade. Use it well.
Prepared by Dr. Zhijiang Chen — Frostburg State University, Summer 2026.