Lecture Thirteen · 15 June 2026
13Lecture Thirteen

AI Topic II —
Token Economics
& Project ROI

When the cost of running software is dominated by the cost of thinking. Tokens, caching, agents, and the new unit economics.

Instructor
Dr. Zhijiang Chen
Session
No. 13 of 16
Date
15 June 2026
Room
YF108
Duration
110 minutes
Format
Lecture + Course recap
Last content lecture before student presentations. The final exam is built from these 13 sessions.
Lecture XIIIAgenda02 / 22
Today's Plan

The cost stack of an AI product.

§TopicMinutes
I.The token cost stack — input / output / cached15
II.The 2026 LLM pricing landscape10
III.Agent cost multipliers (3–10× a chatbot)15
IV.Caching strategies (40–90% savings)20
Discussion: which caching tier wins for your project?10
V.Payback periods by industry (4.1mo / 6.7mo / 9.3mo)15
VI.Course wrap-up & final-exam preview20
HW13, questions5
Part — One
I

Token economics —
per-call cost as
the new unit of analysis.

§ IToken anatomy04 / 22
Three token types, three prices

A single LLM call has three cost components.

TypeWhat it isRelative cost (2026)
Input tokensEverything you send to the model: system prompt, user message, context, tool definitions.1× base
Output tokensEverything the model generates: response, tool calls, reasoning traces.3–5× input
Cached input tokensInput tokens already seen, served from the model's cache.0.1× input

Output tokens are 3–5× more expensive than input tokens because generation is inherently more compute-intensive than reading. Cached input is ~90% cheaper than fresh input.

The cost structure rewards short outputs, structured outputs, and repeated prompts. Your model's pricing is the boundary of your product economics.

Part — Two
II

The 2026 pricing landscape —
shop before
you commit.

§ IIVendor pricing06 / 22
$ per million tokens, mid-2026

The major models, side by side.

ModelInput $/MTokOutput $/MTokCached input
Claude Opus 4.7$15$75$1.50
Claude Sonnet 4.6$3$15$0.30
GPT-4o$2.50$10$1.25
GPT-4o mini$0.15$0.60$0.075
Gemini 2.5 Pro$3.50$10.50$0.50

The cheapest model is often only ~6% the price of the flagship. Choose model by task difficulty, not brand — many production agents mix tiers.

Part — Three
III

Agent multipliers —
why one user request
becomes ten LLM calls.

§ IIIAgent fan-out08 / 22
Why agents cost what they cost

One user task → 3 to 10 LLM calls.

Task phaseTypical callsWhy
Planning1–2Decompose user request into sub-steps.
Tool selection / arg-building1–3Choose APIs, generate parameters.
Execution & iteration2–5Run tool, inspect result, decide next step.
Verification1–2Re-check the answer; self-correct.
Response synthesis1Generate user-facing answer.
Typical total6–13Per user task.

An unconstrained agent task can cost $5–8. Budgets that quote per-token cost without accounting for fan-out are wrong by an order of magnitude.

§ IIICost build-up09 / 22
A worked example

One user task — Claude Sonnet 4.6 — full cost.

CallInput tokensOutput tokensCost
Plan5,000500$0.0225
Tool #1 (read docs)12,000200$0.0390
Tool #2 (search)8,000400$0.0300
Iterate (3 calls)30,0001,200$0.1080
Verify10,000300$0.0345
Synthesize12,000800$0.0480
Total per task77,0003,400$0.282

For 100,000 tasks/month: $28,200. Multiply by 12 months: $338K/year, just on inference. Caching can cut this by 70%.

Part — Four
IV

Caching —
where 40–90%
of savings hide.

§ IVThree cache tiers11 / 22
Different ideas, different savings

The three caching tiers in production.

① Prompt cache

Reuse identical input prefixes (system prompt + context). 80–90% cost cut on cached tokens. Provider-managed.

② Semantic cache

Reuse responses for similar requests. Embedding-based matching. 40–70% savings on cacheable workloads.

③ KV cache reuse

Reuse the attention key/value state across decode steps. Provider-internal. 75% latency cut.

Stack them — they compose. Teams that implement all three see 70–90% production cost reduction relative to a naive implementation.

§ IVCache hit rates12 / 22
What hit rate can you achieve?

Hit rate by workload pattern.

WorkloadTypical cache hit %Cost reduction
Customer support — repeated FAQs60–80%50–70%
Code assistant — repeated codebase context70–90%60–80%
Search / chat — unique queries15–30%10–25%
Document analysis — long shared prefix85–95%70–85%
Agentic workflow — repeated planning prompts60–80%50–70%

An honest budget pessimistically assumes the lower bound. If your workload happens to be highly cacheable, you'll be pleasantly surprised; if it isn't, you won't have over-promised.

Discussion10 minutes13 / 22
Which cache for your project?

Which caching tier would give your project the largest savings — and how would you measure?

In pairs (4 min), categorise your project's workload. Estimate hit rate. Compute monthly savings if implemented.

  • What fraction of your requests share a long input prefix? → prompt cache opportunity.
  • Could two requests safely share the same response? → semantic cache opportunity.
  • What experiment would prove the hit-rate assumption before committing engineering effort?
Part — Five
V

Payback periods —
industry data,
2026.

§ VIndustry benchmarks15 / 22
Where AI pays back fastest

Median payback period by use case.

Use caseMedian paybackWhy fast / slow
Customer support4.1 moHigh labor-cost displacement, narrow domain.
Marketing operations6.7 moVolume work, low criticality.
Sales enablement7.5 moConversion lift offsets cost.
Engineering productivity9.3 moSenior-engineer verification overhead.
Compliance / legal14 moHigh accuracy bar; many false positives to triage.

Source: 2026 cross-industry surveys (VentureBeat 1,100-engineer-and-CTO study; Digital Applied 100+ ROI data points).

For your group project: an AI feature with payback > 18 months is rarely funded. If yours lands there, find ways to halve the cost or double the value.

§ VMid-market case16 / 22
A concrete payback computation

A 9-month payback example.

Hypothetical SaaS company, 8,000 monthly support tickets, $18.40 average resolution cost. Deploy an AI support agent.

ItemMonthly
Tickets deflected (34%)2,720
Resolution cost saved (2,720 × $12.20)+$33,184
AI infrastructure (tokens + observability)−$3,800
Monthly net+$29,384
One-time build cost$72,000
Simple payback2.5 months
Discounted payback @ 12%2.6 months

When AI pays back fast, it's almost always because it displaces high-marginal-cost labour at high volume. When it pays back slowly, it's usually because a human still has to verify each output.

Part — Six
VI

Course wrap-up —
thirteen sessions
in five lines.

§ VICourse recap18 / 22
What you now know

The course, in five lines.

  1. Every software decision is an economic decision. Boehm's seven steps apply at any scale.
  2. Cost lives in a fixed/variable × direct/indirect × non-recurring/recurring cell, across five lifecycle phases. 60–80% lives after launch.
  3. Time value of money is the calculus of comparison. NPV is the verdict; IRR, payback, PI are the footnotes.
  4. Sensitivity tells you which assumption to research; risk analysis tells you the probability of being wrong.
  5. The AI era changes the coefficients, not the equations. Same Boehm framework, recalibrated for tokens, verification, and a re-allocated productivity curve.

If you walk away with only those five lines, you can defend any software decision in any room you'll enter for the next decade.

§ VIFinal exam preview19 / 22
Lecture 16, Thursday 18 June

What's on the exam.

SectionPointsMaterial
Multiple choice / short answer20Definitions, frameworks, intuitions from all 13 lectures.
Computation50PV/FV, NPV/IRR, equivalence, sensitivity, FP, COCOMO II.
Integrative case30A realistic project — full economic analysis, including AI cost.

Closed book; one A4 cheat sheet (single-sided) permitted; non-programmable calculator.

HomeworkFinal preparation20 / 22
Homework 13 — final group-project polish

Get your slides ready.

  1. Finalise group-project presentation. Required slides: scope, FP/COCOMO estimate, cash flow + 2 alternatives, NPV/IRR/payback/PI, sensitivity, AI-cost dimension (bonus), recommendation.
  2. Rehearse — 12 min talk + 5 min Q&A + 3 min peer evaluation.
  3. Submit final project report PDF to /submissions/PROJECT/<team-name>/ before tomorrow's class.
RecapWhat to remember21 / 22

What today bought you.

  1. Token cost = input + output (3–5× input) + caching. Output tokens dominate.
  2. Agents fan out 3–10×. Per-call cost ≠ per-task cost.
  3. Caching is the highest-ROI optimisation. 40–90% in production is standard, not exceptional.
  4. Payback depends on how much human labour you displace, and how cheaply.
EndLecture Thirteen · End of Content22 / 22
&

Questions & conversation.

Dr. Zhijiang Chen
Software Engineering Economics · Summer 2026
frostburg-state-university.github.io/bju