When the cost of running software is dominated by the cost of thinking. Tokens, caching, agents, and the new unit economics.
| § | Topic | Minutes |
|---|---|---|
| I. | The token cost stack — input / output / cached | 15 |
| II. | The 2026 LLM pricing landscape | 10 |
| III. | Agent cost multipliers (3–10× a chatbot) | 15 |
| IV. | Caching strategies (40–90% savings) | 20 |
| — | Discussion: which caching tier wins for your project? | 10 |
| V. | Payback periods by industry (4.1mo / 6.7mo / 9.3mo) | 15 |
| VI. | Course wrap-up & final-exam preview | 20 |
| HW13, questions | 5 |
| Type | What it is | Relative cost (2026) |
|---|---|---|
| Input tokens | Everything you send to the model: system prompt, user message, context, tool definitions. | 1× base |
| Output tokens | Everything the model generates: response, tool calls, reasoning traces. | 3–5× input |
| Cached input tokens | Input tokens already seen, served from the model's cache. | 0.1× input |
Output tokens are 3–5× more expensive than input tokens because generation is inherently more compute-intensive than reading. Cached input is ~90% cheaper than fresh input.
The cost structure rewards short outputs, structured outputs, and repeated prompts. Your model's pricing is the boundary of your product economics.
| Model | Input $/MTok | Output $/MTok | Cached input |
|---|---|---|---|
| Claude Opus 4.7 | $15 | $75 | $1.50 |
| Claude Sonnet 4.6 | $3 | $15 | $0.30 |
| GPT-4o | $2.50 | $10 | $1.25 |
| GPT-4o mini | $0.15 | $0.60 | $0.075 |
| Gemini 2.5 Pro | $3.50 | $10.50 | $0.50 |
The cheapest model is often only ~6% the price of the flagship. Choose model by task difficulty, not brand — many production agents mix tiers.
| Task phase | Typical calls | Why |
|---|---|---|
| Planning | 1–2 | Decompose user request into sub-steps. |
| Tool selection / arg-building | 1–3 | Choose APIs, generate parameters. |
| Execution & iteration | 2–5 | Run tool, inspect result, decide next step. |
| Verification | 1–2 | Re-check the answer; self-correct. |
| Response synthesis | 1 | Generate user-facing answer. |
| Typical total | 6–13 | Per user task. |
An unconstrained agent task can cost $5–8. Budgets that quote per-token cost without accounting for fan-out are wrong by an order of magnitude.
| Call | Input tokens | Output tokens | Cost |
|---|---|---|---|
| Plan | 5,000 | 500 | $0.0225 |
| Tool #1 (read docs) | 12,000 | 200 | $0.0390 |
| Tool #2 (search) | 8,000 | 400 | $0.0300 |
| Iterate (3 calls) | 30,000 | 1,200 | $0.1080 |
| Verify | 10,000 | 300 | $0.0345 |
| Synthesize | 12,000 | 800 | $0.0480 |
| Total per task | 77,000 | 3,400 | $0.282 |
For 100,000 tasks/month: $28,200. Multiply by 12 months: $338K/year, just on inference. Caching can cut this by 70%.
Reuse identical input prefixes (system prompt + context). 80–90% cost cut on cached tokens. Provider-managed.
Reuse responses for similar requests. Embedding-based matching. 40–70% savings on cacheable workloads.
Reuse the attention key/value state across decode steps. Provider-internal. 75% latency cut.
Stack them — they compose. Teams that implement all three see 70–90% production cost reduction relative to a naive implementation.
| Workload | Typical cache hit % | Cost reduction |
|---|---|---|
| Customer support — repeated FAQs | 60–80% | 50–70% |
| Code assistant — repeated codebase context | 70–90% | 60–80% |
| Search / chat — unique queries | 15–30% | 10–25% |
| Document analysis — long shared prefix | 85–95% | 70–85% |
| Agentic workflow — repeated planning prompts | 60–80% | 50–70% |
An honest budget pessimistically assumes the lower bound. If your workload happens to be highly cacheable, you'll be pleasantly surprised; if it isn't, you won't have over-promised.
In pairs (4 min), categorise your project's workload. Estimate hit rate. Compute monthly savings if implemented.
| Use case | Median payback | Why fast / slow |
|---|---|---|
| Customer support | 4.1 mo | High labor-cost displacement, narrow domain. |
| Marketing operations | 6.7 mo | Volume work, low criticality. |
| Sales enablement | 7.5 mo | Conversion lift offsets cost. |
| Engineering productivity | 9.3 mo | Senior-engineer verification overhead. |
| Compliance / legal | 14 mo | High accuracy bar; many false positives to triage. |
Source: 2026 cross-industry surveys (VentureBeat 1,100-engineer-and-CTO study; Digital Applied 100+ ROI data points).
For your group project: an AI feature with payback > 18 months is rarely funded. If yours lands there, find ways to halve the cost or double the value.
Hypothetical SaaS company, 8,000 monthly support tickets, $18.40 average resolution cost. Deploy an AI support agent.
| Item | Monthly |
|---|---|
| Tickets deflected (34%) | 2,720 |
| Resolution cost saved (2,720 × $12.20) | +$33,184 |
| AI infrastructure (tokens + observability) | −$3,800 |
| Monthly net | +$29,384 |
| One-time build cost | $72,000 |
| Simple payback | 2.5 months |
| Discounted payback @ 12% | 2.6 months |
When AI pays back fast, it's almost always because it displaces high-marginal-cost labour at high volume. When it pays back slowly, it's usually because a human still has to verify each output.
If you walk away with only those five lines, you can defend any software decision in any room you'll enter for the next decade.
| Section | Points | Material |
|---|---|---|
| Multiple choice / short answer | 20 | Definitions, frameworks, intuitions from all 13 lectures. |
| Computation | 50 | PV/FV, NPV/IRR, equivalence, sensitivity, FP, COCOMO II. |
| Integrative case | 30 | A realistic project — full economic analysis, including AI cost. |
Closed book; one A4 cheat sheet (single-sided) permitted; non-programmable calculator.
/submissions/PROJECT/<team-name>/ before tomorrow's class.Dr. Zhijiang Chen
Software Engineering Economics · Summer 2026
frostburg-state-university.github.io/bju