The Rate Limit Reality
Every LLM API has rate limits on two axes: requests per minute (RPM) and tokens per minute (TPM). At scale, you will hit both. The naive approach — retry on 429 — causes thundering herd: all requests retry simultaneously, hitting the limit again.
import time, random, httpx
def call_with_backoff(fn, max_retries: int = 5):
for attempt in range(max_retries):
try:
return fn()
except httpx.HTTPStatusError as e:
if e.response.status_code == 429:
# Respect Retry-After header if present
retry_after = float(e.response.headers.get("retry-after", 0))
# Exponential backoff + jitter to prevent thundering herd
backoff = max(retry_after, 2 ** attempt + random.uniform(0, 1))
if attempt == max_retries - 1:
raise
time.sleep(min(backoff, 60)) # cap at 60s
elif e.response.status_code in (500, 502, 503):
if attempt == max_retries - 1:
raise
time.sleep(2 ** attempt)
else:
raise # 400, 401, 404 — don't retryImportant: 400 errors (bad request) and 401 errors (invalid API key) should never be retried — they will never succeed regardless of waiting. Only 429 (rate limit) and 5xx (server errors) are retryable.
Prompt Caching: The Cost Reduction Nobody Talks About
Both OpenAI and Anthropic offer prompt caching: if the same prompt prefix is sent within a 5-minute window, the cached portion is charged at ~10% of normal input token cost. For applications with a fixed system prompt (common in chatbots), this means 80–90% of input tokens in a conversation are served from cache after the first message.
# Maximize cache hits: long, stable content first; dynamic content last
# BAD: dynamic content in system prompt (busts cache every request)
system = f"You are a helpful assistant. Today's date is {datetime.now()}."
# GOOD: stable system prompt; dynamic context in user message
system = """You are a helpful assistant for Acme Corp. [500 lines of stable context...]"""
user = f"Today is {datetime.now().strftime('%Y-%m-%d')}. User question: {question}"
# Anthropic: cache_control marks what to cache
messages = [{
"role": "user",
"content": [{
"type": "text",
"text": long_stable_context,
"cache_control": {"type": "ephemeral"} # cache this prefix
}, {
"type": "text",
"text": dynamic_question # this changes per request
}]
}]Cost Budgeting and Per-Request Guardrails
LLM costs scale with usage in ways that surprise teams not used to consumption-based billing. A bug that causes a loop sending large prompts can generate thousands of dollars of API calls in minutes.
import tiktoken
MAX_INPUT_TOKENS = 4000 # hard limit per request
MAX_OUTPUT_TOKENS = 1000 # max completion length
COST_PER_1K_INPUT = 0.00015 # gpt-4o-mini input rate
COST_PER_1K_OUTPUT = 0.00060 # gpt-4o-mini output rate
def safe_llm_call(messages: list, user_id: str) -> str:
enc = tiktoken.encoding_for_model("gpt-4o-mini")
input_tokens = sum(len(enc.encode(m["content"])) for m in messages)
if input_tokens > MAX_INPUT_TOKENS:
raise ValueError(f"Prompt too long: {input_tokens} tokens (max {MAX_INPUT_TOKENS})")
# Check per-user daily budget
daily_spend = get_user_spend_today(user_id) # from Redis/DB
if daily_spend > USER_DAILY_BUDGET_USD:
raise BudgetExceededError(f"Daily LLM budget exceeded for user {user_id}")
response = call_openai(messages, max_tokens=MAX_OUTPUT_TOKENS)
usage = response["usage"]
cost = (usage["prompt_tokens"] / 1000 * COST_PER_1K_INPUT +
usage["completion_tokens"] / 1000 * COST_PER_1K_OUTPUT)
increment_user_spend(user_id, cost)
return response["choices"][0]["message"]["content"]Fallback Model Strategy
Single-provider LLM applications have a fragile dependency. LLM APIs have experienced outages — OpenAI's November 2023 outage lasted hours and affected thousands of production applications. A fallback strategy:
- Primary: Best quality model (gpt-4o, claude-3-5-sonnet)
- Degraded: Cheaper/faster model on same provider (gpt-4o-mini) on 429/5xx
- Emergency: Alternative provider (if primary is down entirely) or cached responses
FALLBACK_CHAIN = [
{"model": "gpt-4o", "provider": "openai"},
{"model": "gpt-4o-mini", "provider": "openai"},
{"model": "mistral-large", "provider": "mistral"},
]
def resilient_call(messages: list) -> str:
for config in FALLBACK_CHAIN:
try:
return call_provider(config, messages)
except (RateLimitError, ServiceUnavailableError):
log_fallback(config["model"])
continue
raise AllProvidersFailedError()Latency: Where the Time Goes
| Phase | Typical Duration | Optimization |
|---|---|---|
| Network round-trip (request) | 20–80ms | Use nearest API region; keep connections warm |
| Prefill (processing input tokens) | 50–500ms | Shorten prompts; use prompt caching |
| Time to first token (TTFT) | 200ms–2s | Stream responses; use faster models for first token |
| Decode (generating output) | 0.5–10s | Limit max_tokens; use smaller models; speculative decoding |
| Network round-trip (response) | 20–80ms | Stream instead of waiting for full response |
The single most impactful latency optimization for user-facing applications is streaming. Instead of waiting for the full response (3–10s) before showing anything, stream tokens as they're generated. First visible output appears in 200–500ms; total latency is the same but perceived latency is 10× lower.
Observability: What to Log
trace = {
"trace_id": str(uuid4()),
"timestamp": datetime.utcnow().isoformat(),
"user_id": user_id, # for per-user cost tracking
"feature": "summarizer", # which product feature
"model": model_used,
"prompt_version": "v2.3.1",
"input_tokens": usage.prompt_tokens,
"output_tokens": usage.completion_tokens,
"cost_usd": calculated_cost,
"latency_ms": elapsed,
"ttft_ms": time_to_first_token,
"finish_reason": finish_reason, # "stop" | "length" | "tool_calls"
"fallback_used": fallback_model or None,
"error": error_type or None
}Ship this trace to your logging system (Datadog, Grafana, CloudWatch). Build dashboards on: total cost/hour by feature, p95 latency by model, error rate by error type, and finish_reason == "length" rate (indicates max_tokens is truncating responses).
Before shipping an LLM feature: (1) retry with exponential backoff + jitter, (2) per-request token limits, (3) cost tracking per user/feature, (4) streaming enabled, (5) at least one fallback model, (6) structured logging with trace IDs, (7) alerts on error rate > 1% and cost spike > 2× baseline, (8) eval suite in CI that runs on every prompt change.
The Context Window Is Not a Magic Solution
128k token context windows tempt developers to stuff everything into the prompt. Resist this for latency and cost reasons, but also for quality: model performance degrades measurably on tasks requiring precise recall from very long contexts. A 128k-token context is not the same as a database — it's a blurry attention over a very long document. Use RAG to retrieve precisely what's needed rather than hoping the model will find it in a 100k-token haystack.