The new LLM pricing math: how to cut your API bill without changing models
Token prices dropped again, but the real savings are in routing, caching, and context discipline. A practical framework for lowering cost per task.
The fastest way to cut your LLM bill is not switching to a cheaper model — it’s routing simple tasks to small models, caching repeated context, and trimming prompt bloat. Together these routinely cut cost per task by half or more without any drop in output quality.
- 01Cost per task, not price per token, is the number that matters.
- 02Route easy requests to a small model and reserve the frontier model for hard ones.
- 03Prompt caching can slash repeated-context costs dramatically.
- 04Context discipline — sending only what’s needed — is the cheapest optimisation available.
Table of Contents
Every few months a lab cuts token prices and the internet celebrates. But if your bill is climbing, the price per token is rarely the problem — how you spend those tokens is. The teams with the lowest cost per task are not always on the cheapest model; they are the ones being deliberate about routing, caching, and context.
Stop optimising the wrong number
Price per token is a vendor metric. The number that shows up on your invoice is cost per task: how many tokens a completed unit of work consumes. A model half the price that needs three retries and a bloated prompt can cost more than a pricier one that gets it right the first time.
Route, don’t default
Most applications send everything to their best model out of habit. A large share of real requests — classification, extraction, short rewrites — are handled perfectly by a small, cheap model. Routing those away from the frontier tier is usually the single biggest saving available, and users never notice.
The cheapest token is the one you never send.
Cache what repeats
If your prompts share a long fixed preamble — a system prompt, a document, a set of examples — prompt caching lets you pay full price once and a fraction thereafter. For chat apps and document workflows with heavy repeated context, this alone can reshape a bill.
Trim the context
Sending an entire document when a model needs two paragraphs is the most common form of waste. Retrieve and pass only what the task requires. Context discipline costs nothing to adopt and compounds on every single call.
We track every price change and tier shift as it happens in LLM Launches & Updates.
Frequently asked questions
Should I always use the cheapest model?
No. Match the model to the task: small models for simple work, frontier models for genuinely hard reasoning. Optimise cost per completed task, not price per token.
What is prompt caching?
A feature that lets you reuse a fixed portion of a prompt across calls at a reduced rate, cutting the cost of repeated context like system prompts or documents.
How much can these techniques save?
It varies, but routing and caching together commonly cut cost per task by half or more without reducing output quality.