The Rate Limit Reality

Every LLM API has rate limits on two axes: requests per minute (RPM) and tokens per minute (TPM). At scale, you will hit both. The naive approach — retry on 429 — causes thundering herd: all requests retry simultaneously, hitting the limit again.

import time, random, httpx

def call_with_backoff(fn, max_retries: int = 5):
    for attempt in range(max_retries):
        try:
            return fn()
        except httpx.HTTPStatusError as e:
            if e.response.status_code == 429:
                # Respect Retry-After header if present
                retry_after = float(e.response.headers.get("retry-after", 0))
                # Exponential backoff + jitter to prevent thundering herd
                backoff = max(retry_after, 2 ** attempt + random.uniform(0, 1))
                if attempt == max_retries - 1:
                    raise
                time.sleep(min(backoff, 60))  # cap at 60s
            elif e.response.status_code in (500, 502, 503):
                if attempt == max_retries - 1:
                    raise
                time.sleep(2 ** attempt)
            else:
                raise  # 400, 401, 404 — don't retry

Important: 400 errors (bad request) and 401 errors (invalid API key) should never be retried — they will never succeed regardless of waiting. Only 429 (rate limit) and 5xx (server errors) are retryable.

Prompt Caching: The Cost Reduction Nobody Talks About

Both OpenAI and Anthropic offer prompt caching: if the same prompt prefix is sent within a 5-minute window, the cached portion is charged at ~10% of normal input token cost. For applications with a fixed system prompt (common in chatbots), this means 80–90% of input tokens in a conversation are served from cache after the first message.

# Maximize cache hits: long, stable content first; dynamic content last

# BAD: dynamic content in system prompt (busts cache every request)
system = f"You are a helpful assistant. Today's date is {datetime.now()}."

# GOOD: stable system prompt; dynamic context in user message
system = """You are a helpful assistant for Acme Corp. [500 lines of stable context...]"""
user = f"Today is {datetime.now().strftime('%Y-%m-%d')}. User question: {question}"

# Anthropic: cache_control marks what to cache
messages = [{
    "role": "user",
    "content": [{
        "type": "text",
        "text": long_stable_context,
        "cache_control": {"type": "ephemeral"}  # cache this prefix
    }, {
        "type": "text",
        "text": dynamic_question  # this changes per request
    }]
}]

Cost Budgeting and Per-Request Guardrails

LLM costs scale with usage in ways that surprise teams not used to consumption-based billing. A bug that causes a loop sending large prompts can generate thousands of dollars of API calls in minutes.

import tiktoken

MAX_INPUT_TOKENS = 4000   # hard limit per request
MAX_OUTPUT_TOKENS = 1000  # max completion length
COST_PER_1K_INPUT = 0.00015   # gpt-4o-mini input rate
COST_PER_1K_OUTPUT = 0.00060  # gpt-4o-mini output rate

def safe_llm_call(messages: list, user_id: str) -> str:
    enc = tiktoken.encoding_for_model("gpt-4o-mini")
    input_tokens = sum(len(enc.encode(m["content"])) for m in messages)

    if input_tokens > MAX_INPUT_TOKENS:
        raise ValueError(f"Prompt too long: {input_tokens} tokens (max {MAX_INPUT_TOKENS})")

    # Check per-user daily budget
    daily_spend = get_user_spend_today(user_id)  # from Redis/DB
    if daily_spend > USER_DAILY_BUDGET_USD:
        raise BudgetExceededError(f"Daily LLM budget exceeded for user {user_id}")

    response = call_openai(messages, max_tokens=MAX_OUTPUT_TOKENS)
    usage = response["usage"]
    cost = (usage["prompt_tokens"] / 1000 * COST_PER_1K_INPUT +
            usage["completion_tokens"] / 1000 * COST_PER_1K_OUTPUT)
    increment_user_spend(user_id, cost)
    return response["choices"][0]["message"]["content"]

Fallback Model Strategy

Single-provider LLM applications have a fragile dependency. LLM APIs have experienced outages — OpenAI's November 2023 outage lasted hours and affected thousands of production applications. A fallback strategy:

  1. Primary: Best quality model (gpt-4o, claude-3-5-sonnet)
  2. Degraded: Cheaper/faster model on same provider (gpt-4o-mini) on 429/5xx
  3. Emergency: Alternative provider (if primary is down entirely) or cached responses
FALLBACK_CHAIN = [
    {"model": "gpt-4o",           "provider": "openai"},
    {"model": "gpt-4o-mini",      "provider": "openai"},
    {"model": "mistral-large",    "provider": "mistral"},
]

def resilient_call(messages: list) -> str:
    for config in FALLBACK_CHAIN:
        try:
            return call_provider(config, messages)
        except (RateLimitError, ServiceUnavailableError):
            log_fallback(config["model"])
            continue
    raise AllProvidersFailedError()

Latency: Where the Time Goes

PhaseTypical DurationOptimization
Network round-trip (request)20–80msUse nearest API region; keep connections warm
Prefill (processing input tokens)50–500msShorten prompts; use prompt caching
Time to first token (TTFT)200ms–2sStream responses; use faster models for first token
Decode (generating output)0.5–10sLimit max_tokens; use smaller models; speculative decoding
Network round-trip (response)20–80msStream instead of waiting for full response

The single most impactful latency optimization for user-facing applications is streaming. Instead of waiting for the full response (3–10s) before showing anything, stream tokens as they're generated. First visible output appears in 200–500ms; total latency is the same but perceived latency is 10× lower.

Observability: What to Log

trace = {
    "trace_id": str(uuid4()),
    "timestamp": datetime.utcnow().isoformat(),
    "user_id": user_id,               # for per-user cost tracking
    "feature": "summarizer",          # which product feature
    "model": model_used,
    "prompt_version": "v2.3.1",
    "input_tokens": usage.prompt_tokens,
    "output_tokens": usage.completion_tokens,
    "cost_usd": calculated_cost,
    "latency_ms": elapsed,
    "ttft_ms": time_to_first_token,
    "finish_reason": finish_reason,    # "stop" | "length" | "tool_calls"
    "fallback_used": fallback_model or None,
    "error": error_type or None
}

Ship this trace to your logging system (Datadog, Grafana, CloudWatch). Build dashboards on: total cost/hour by feature, p95 latency by model, error rate by error type, and finish_reason == "length" rate (indicates max_tokens is truncating responses).

✅ The Production Checklist

Before shipping an LLM feature: (1) retry with exponential backoff + jitter, (2) per-request token limits, (3) cost tracking per user/feature, (4) streaming enabled, (5) at least one fallback model, (6) structured logging with trace IDs, (7) alerts on error rate > 1% and cost spike > 2× baseline, (8) eval suite in CI that runs on every prompt change.

The Context Window Is Not a Magic Solution

128k token context windows tempt developers to stuff everything into the prompt. Resist this for latency and cost reasons, but also for quality: model performance degrades measurably on tasks requiring precise recall from very long contexts. A 128k-token context is not the same as a database — it's a blurry attention over a very long document. Use RAG to retrieve precisely what's needed rather than hoping the model will find it in a 100k-token haystack.