Lecture Thirteen · 15 June 2026

13Lecture Thirteen

AI Topic II —
Token Economics
& Project ROI

When the cost of running software is dominated by the cost of thinking. Tokens, caching, agents, and the new unit economics.

Instructor

Dr. Zhijiang Chen

Session

No. 13 of 16

Date

15 June 2026

Room

YF108

Duration

110 minutes

Format

Lecture + Course recap

Last content lecture before student presentations. The final exam is built from these 13 sessions.

Lecture XIIIAgenda02 / 22

Today's Plan

The cost stack of an AI product.

§	Topic	Minutes
I.	The token cost stack — input / output / cached	15
II.	The 2026 LLM pricing landscape	10
III.	Agent cost multipliers (3–10× a chatbot)	15
IV.	Caching strategies (40–90% savings)	20
—	Discussion: which caching tier wins for your project?	10
V.	Payback periods by industry (4.1mo / 6.7mo / 9.3mo)	15
VI.	Course wrap-up & final-exam preview	20
	HW13, questions	5

Part — One

Token economics —
per-call cost as
the new unit of analysis.

§ IToken anatomy04 / 22

Three token types, three prices

A single LLM call has three cost components.

Type	What it is	Relative cost (2026)
Input tokens	Everything you send to the model: system prompt, user message, context, tool definitions.	1× base
Output tokens	Everything the model generates: response, tool calls, reasoning traces.	3–5× input
Cached input tokens	Input tokens already seen, served from the model's cache.	0.1× input

Output tokens are 3–5× more expensive than input tokens because generation is inherently more compute-intensive than reading. Cached input is ~90% cheaper than fresh input.

The cost structure rewards short outputs, structured outputs, and repeated prompts. Your model's pricing is the boundary of your product economics.

Part — Two

The 2026 pricing landscape —
shop before
you commit.

§ IIVendor pricing06 / 22

$ per million tokens, mid-2026

The major models, side by side.

Model	Input $/MTok	Output $/MTok	Cached input
Claude Opus 4.7	$15	$75	$1.50
Claude Sonnet 4.6	$3	$15	$0.30
GPT-4o	$2.50	$10	$1.25
GPT-4o mini	$0.15	$0.60	$0.075
Gemini 2.5 Pro	$3.50	$10.50	$0.50

The cheapest model is often only ~6% the price of the flagship. Choose model by task difficulty, not brand — many production agents mix tiers.

Part — Three

III

Agent multipliers —
why one user request
becomes ten LLM calls.

§ IIIAgent fan-out08 / 22

Why agents cost what they cost

One user task → 3 to 10 LLM calls.

Task phase	Typical calls	Why
Planning	1–2	Decompose user request into sub-steps.
Tool selection / arg-building	1–3	Choose APIs, generate parameters.
Execution & iteration	2–5	Run tool, inspect result, decide next step.
Verification	1–2	Re-check the answer; self-correct.
Response synthesis	1	Generate user-facing answer.
Typical total	6–13	Per user task.

An unconstrained agent task can cost $5–8. Budgets that quote per-token cost without accounting for fan-out are wrong by an order of magnitude.

§ IIICost build-up09 / 22

A worked example

One user task — Claude Sonnet 4.6 — full cost.

Call	Input tokens	Output tokens	Cost
Plan	5,000	500	$0.0225
Tool #1 (read docs)	12,000	200	$0.0390
Tool #2 (search)	8,000	400	$0.0300
Iterate (3 calls)	30,000	1,200	$0.1080
Verify	10,000	300	$0.0345
Synthesize	12,000	800	$0.0480
Total per task	77,000	3,400	$0.282

For 100,000 tasks/month: $28,200. Multiply by 12 months: $338K/year, just on inference. Caching can cut this by 70%.

Part — Four

Caching —
where 40–90%
of savings hide.

§ IVThree cache tiers11 / 22

Different ideas, different savings

The three caching tiers in production.

① Prompt cache

Reuse identical input prefixes (system prompt + context). 80–90% cost cut on cached tokens. Provider-managed.

② Semantic cache

Reuse responses for similar requests. Embedding-based matching. 40–70% savings on cacheable workloads.

③ KV cache reuse

Reuse the attention key/value state across decode steps. Provider-internal. 75% latency cut.

Stack them — they compose. Teams that implement all three see 70–90% production cost reduction relative to a naive implementation.

§ IVCache hit rates12 / 22

What hit rate can you achieve?

Hit rate by workload pattern.

Workload	Typical cache hit %	Cost reduction
Customer support — repeated FAQs	60–80%	50–70%
Code assistant — repeated codebase context	70–90%	60–80%
Search / chat — unique queries	15–30%	10–25%
Document analysis — long shared prefix	85–95%	70–85%
Agentic workflow — repeated planning prompts	60–80%	50–70%

An honest budget pessimistically assumes the lower bound. If your workload happens to be highly cacheable, you'll be pleasantly surprised; if it isn't, you won't have over-promised.

Discussion10 minutes13 / 22

Which cache for your project?

⚡

Which caching tier would give your project the largest savings — and how would you measure?

In pairs (4 min), categorise your project's workload. Estimate hit rate. Compute monthly savings if implemented.

What fraction of your requests share a long input prefix? → prompt cache opportunity.
Could two requests safely share the same response? → semantic cache opportunity.
What experiment would prove the hit-rate assumption before committing engineering effort?

Part — Five

Payback periods —
industry data,
2026.

§ VIndustry benchmarks15 / 22

Where AI pays back fastest

Median payback period by use case.

Use case	Median payback	Why fast / slow
Customer support	4.1 mo	High labor-cost displacement, narrow domain.
Marketing operations	6.7 mo	Volume work, low criticality.
Sales enablement	7.5 mo	Conversion lift offsets cost.
Engineering productivity	9.3 mo	Senior-engineer verification overhead.
Compliance / legal	14 mo	High accuracy bar; many false positives to triage.

Source: 2026 cross-industry surveys (VentureBeat 1,100-engineer-and-CTO study; Digital Applied 100+ ROI data points).

For your group project: an AI feature with payback > 18 months is rarely funded. If yours lands there, find ways to halve the cost or double the value.

§ VMid-market case16 / 22

A concrete payback computation

A 9-month payback example.

Hypothetical SaaS company, 8,000 monthly support tickets, $18.40 average resolution cost. Deploy an AI support agent.

Item	Monthly
Tickets deflected (34%)	2,720
Resolution cost saved (2,720 × $12.20)	+$33,184
AI infrastructure (tokens + observability)	−$3,800
Monthly net	+$29,384
One-time build cost	$72,000
Simple payback	2.5 months
Discounted payback @ 12%	2.6 months

When AI pays back fast, it's almost always because it displaces high-marginal-cost labour at high volume. When it pays back slowly, it's usually because a human still has to verify each output.

Part — Six

Course wrap-up —
thirteen sessions
in five lines.

§ VICourse recap18 / 22

What you now know

The course, in five lines.

Every software decision is an economic decision. Boehm's seven steps apply at any scale.
Cost lives in a fixed/variable × direct/indirect × non-recurring/recurring cell, across five lifecycle phases. 60–80% lives after launch.
Time value of money is the calculus of comparison. NPV is the verdict; IRR, payback, PI are the footnotes.
Sensitivity tells you which assumption to research; risk analysis tells you the probability of being wrong.
The AI era changes the coefficients, not the equations. Same Boehm framework, recalibrated for tokens, verification, and a re-allocated productivity curve.

If you walk away with only those five lines, you can defend any software decision in any room you'll enter for the next decade.

§ VIFinal exam preview19 / 22

Lecture 16, Thursday 18 June

What's on the exam.

Section	Points	Material
Multiple choice / short answer	20	Definitions, frameworks, intuitions from all 13 lectures.
Computation	50	PV/FV, NPV/IRR, equivalence, sensitivity, FP, COCOMO II.
Integrative case	30	A realistic project — full economic analysis, including AI cost.

Closed book; one A4 cheat sheet (single-sided) permitted; non-programmable calculator.

HomeworkFinal preparation20 / 22

Homework 13 — final group-project polish

Get your slides ready.

Finalise group-project presentation. Required slides: scope, FP/COCOMO estimate, cash flow + 2 alternatives, NPV/IRR/payback/PI, sensitivity, AI-cost dimension (bonus), recommendation.
Rehearse — 12 min talk + 5 min Q&A + 3 min peer evaluation.
Submit final project report PDF to /submissions/PROJECT/<team-name>/ before tomorrow's class.

RecapWhat to remember21 / 22

What today bought you.

Token cost = input + output (3–5× input) + caching. Output tokens dominate.
Agents fan out 3–10×. Per-call cost ≠ per-task cost.
Caching is the highest-ROI optimisation. 40–90% in production is standard, not exceptional.
Payback depends on how much human labour you displace, and how cheaply.

EndLecture Thirteen · End of Content22 / 22

Questions & conversation.

Dr. Zhijiang Chen
Software Engineering Economics · Summer 2026
frostburg-state-university.github.io/bju

AI Topic II —Token Economics& Project ROI

The cost stack of an AI product.

Token economics —per-call cost asthe new unit of analysis.

A single LLM call has three cost components.

The 2026 pricing landscape —shop beforeyou commit.

The major models, side by side.

Agent multipliers —why one user requestbecomes ten LLM calls.

One user task → 3 to 10 LLM calls.

One user task — Claude Sonnet 4.6 — full cost.

Caching —where 40–90%of savings hide.

The three caching tiers in production.

① Prompt cache

② Semantic cache

③ KV cache reuse

Hit rate by workload pattern.

Which caching tier would give your project the largest savings — and how would you measure?

Payback periods —industry data,2026.

Median payback period by use case.

A 9-month payback example.

Course wrap-up —thirteen sessionsin five lines.

The course, in five lines.

What's on the exam.

Get your slides ready.

What today bought you.

Questions & conversation.

AI Topic II —
Token Economics
& Project ROI

Token economics —
per-call cost as
the new unit of analysis.

The 2026 pricing landscape —
shop before
you commit.

Agent multipliers —
why one user request
becomes ten LLM calls.

Caching —
where 40–90%
of savings hide.

Payback periods —
industry data,
2026.

Course wrap-up —
thirteen sessions
in five lines.