This Week in AI
  Claude Skills hit general availability  New open-weight model tops coding benchmarks  API prices cut across two frontier labs  GitHub trending: local-first agent runtime  Solo founder crosses $14k MRR with AI micro-SaaS  Perplexity ships new answer engine features  Terminal LLM clients gain developer mindshare  Prompt caching becomes default best practice
03 · LLM Launches & Updates

The new LLM pricing math: how to cut your API bill without changing models

Token prices dropped again, but the real savings are in routing, caching, and context discipline. A practical framework for lowering cost per task.

Illustration: Comparison chart of LLM API pricing tiers and cost-per-task
Quick Answer

The fastest way to cut your LLM bill is not switching to a cheaper model — it’s routing simple tasks to small models, caching repeated context, and trimming prompt bloat. Together these routinely cut cost per task by half or more without any drop in output quality.

Key Takeaways
  • 01Cost per task, not price per token, is the number that matters.
  • 02Route easy requests to a small model and reserve the frontier model for hard ones.
  • 03Prompt caching can slash repeated-context costs dramatically.
  • 04Context discipline — sending only what’s needed — is the cheapest optimisation available.
Table of Contents
  1. Stop optimising the wrong number
  2. Route, don’t default
  3. Cache what repeats
  4. Trim the context

Every few months a lab cuts token prices and the internet celebrates. But if your bill is climbing, the price per token is rarely the problem — how you spend those tokens is. The teams with the lowest cost per task are not always on the cheapest model; they are the ones being deliberate about routing, caching, and context.

Stop optimising the wrong number

Price per token is a vendor metric. The number that shows up on your invoice is cost per task: how many tokens a completed unit of work consumes. A model half the price that needs three retries and a bloated prompt can cost more than a pricier one that gets it right the first time.

Route, don’t default

Most applications send everything to their best model out of habit. A large share of real requests — classification, extraction, short rewrites — are handled perfectly by a small, cheap model. Routing those away from the frontier tier is usually the single biggest saving available, and users never notice.

The cheapest token is the one you never send.

Cache what repeats

If your prompts share a long fixed preamble — a system prompt, a document, a set of examples — prompt caching lets you pay full price once and a fraction thereafter. For chat apps and document workflows with heavy repeated context, this alone can reshape a bill.

Trim the context

Sending an entire document when a model needs two paragraphs is the most common form of waste. Retrieve and pass only what the task requires. Context discipline costs nothing to adopt and compounds on every single call.

We track every price change and tier shift as it happens in LLM Launches & Updates.

Frequently asked questions

Should I always use the cheapest model?

No. Match the model to the task: small models for simple work, frontier models for genuinely hard reasoning. Optimise cost per completed task, not price per token.

What is prompt caching?

A feature that lets you reuse a fixed portion of a prompt across calls at a reduced rate, cutting the cost of repeated context like system prompts or documents.

How much can these techniques save?

It varies, but routing and caching together commonly cut cost per task by half or more without reducing output quality.

PN
Written by
Priya Nandakumar

Priya covers the frontier-model market and previously worked in ML infrastructure. She builds the cost models she writes about and checks vendor pricing pages weekly.

Get the weekly AI drop
Five things that mattered, every Thursday.