Cost, Latency, and Production Reliability

Intermediate 🕐 24 min Lesson 13 of 14

What you'll learn

Apply output length constraints and max_tokens to control API costs
Implement prompt caching correctly to reduce costs on stable system prompts
Apply model routing to match model capability to task complexity
Build retry logic with exponential backoff and jitter for LLM API calls

The Production Reality Check

Development environments are forgiving. You make 20 test calls, they cost a few cents, they respond in a few seconds, and if one fails you just retry. Production is different. You make 50,000 calls a day, costs compound, latency affects user experience, and failures cascade if you have not built for them. Most of the surprises developers encounter when shipping AI features to production fall into four categories: cost, latency, rate limits, and error handling. This lesson covers each.

Cost: The Output Token Problem

The pricing model for AI APIs has an important asymmetry: output tokens cost significantly more than input tokens — typically 3 to 5 times more. When you write a prompt that asks the model to "explain in detail" or "provide a comprehensive response," you are generating output tokens. When you do not set a ceiling on output length, you are effectively leaving an open tab.

Two habits prevent most unexpected cost issues:

Constrain output length in the system prompt: "Respond in under 200 words." "Return only the JSON object — no explanation or commentary." "Use bullet points, maximum 5 bullets." These instructions are respected reliably and dramatically reduce output token counts for features where verbose responses add no value.
Set max_tokens on every call: This is a hard ceiling that prevents runaway responses regardless of what the model decides to generate. It also protects against denial-of-wallet attacks, where an adversary crafts inputs designed to maximise your output token consumption.

Prompt Caching: The Cheapest Optimisation You Are Probably Not Using

If your system prompt is 2,000 tokens and you make 10,000 API calls per day, you are spending 20 million tokens per day just on the system prompt. Anthropic's prompt caching reduces the cost of cached tokens by up to 90% and latency by up to 85% on cache hits. Gemini 2.5 introduced implicit caching. OpenAI offers caching on select models.

Prompt caching works on stable content at the beginning of the prompt. The implementation requirement is structural: the content you want cached — your system prompt, reference documents, few-shot examples — must come first in the prompt, before the variable user content. If you structure your prompts this way, caching happens automatically or with minimal configuration. For features with large, stable system prompts and high call volume, this is typically the single highest-ROI optimisation available.

Model Routing: Right-Sizing for the Task

Frontier models — Claude Opus, GPT-4o, Gemini Pro — are the most capable but also the most expensive, often 20–30 times more expensive per token than smaller alternatives. For tasks that do not require frontier capability, using a smaller model is not a trade-off — it is the right choice. Classification, entity extraction, and simple summarisation run well on smaller, faster, cheaper models. Complex reasoning, code generation, and nuanced judgment benefit from larger models.

A practical routing strategy: define task tiers. Tier 1 (classification, extraction, simple Q&A) routes to Claude Haiku, GPT-4o mini, or Gemini Flash Lite. Tier 2 (summarisation, drafting, structured analysis) routes to a mid-tier model. Tier 3 (complex reasoning, code review, multi-step analysis) routes to the frontier model. Teams that implement this explicitly report 50–70% cost reductions at the same output quality.

Rate Limits Are Token-Based, Not Just Request-Based

Every AI API provider enforces two types of rate limits simultaneously: requests per minute (RPM) and tokens per minute (TPM). A single large-context request can exhaust your TPM budget while barely moving your RPM counter. If you are building a feature that processes large documents and hitting rate limit errors despite not hitting the RPM ceiling, TPM is the constraint to investigate.

The correct retry pattern is exponential backoff with jitter. On a rate limit error (HTTP 429), wait for an increasing delay before retrying: first retry after 1 second, then 2, then 4, then 8. The jitter (a small random offset) is critical — without it, all clients that hit the rate limit simultaneously will all retry at the same time, creating a thundering herd that hits the rate limit again immediately.

Error Handling: Treat It Like Any External Dependency

LLM APIs are external network dependencies. They have planned and unplanned maintenance, transient errors, version changes, and rate limits. The same patterns you apply to any external service dependency apply here:

Wrap in try/catch with specific error handling: 429 (rate limit) should retry with backoff. 500/503 (server error) should retry with backoff. 400 (bad request) should not retry — fix the prompt or input.
Set request timeouts: LLM calls can take 10–30 seconds for long outputs. Without a timeout, a slow response can block your request handler indefinitely.
Design a fallback: What does your feature do if the API is unavailable? For non-critical features, returning a graceful degradation message is correct. For critical features, having a fallback provider (if Claude is down, route to OpenAI) is worth the complexity.
Circuit-break after repeated failures: If the API has been failing consistently, stop hammering it and return the fallback immediately rather than queuing up a backlog of requests to retry all at once when it recovers.

Logging every API call with the prompt hash, model version, input token count, output token count, latency, and error code gives you the observability to diagnose patterns — cost spikes, latency regressions, specific input types that fail — before they become user-facing issues.

Key takeaways

Output tokens cost 3–5x more than input tokens — constrain output length in the system prompt and set max_tokens on every call
Prompt caching can reduce costs by up to 90% for stable system prompts — structure your prompts so stable content comes first
Model routing (small fast models for classification, larger for reasoning) typically achieves 50–70% cost reduction at the same quality
Rate limits are token-based, not just request-based — TPM exhaustion can hit even when RPM headroom remains
Exponential backoff with jitter is the correct retry pattern — jitter prevents the thundering herd that re-triggers the rate limit

← Previous

Prompt Injection and Security for AI Features

Putting It Together — The AI-Augmented Developer Workflow