How Autoregressive Generation Works

Decoder-only models (GPT, Llama, Mistral) generate text one token at a time. To generate token N+1, the model needs the full attention output over tokens 1 through N. Naively, this means re-running the entire attention computation over the full sequence for every new token.

For a sequence of length N, generating one token requires:

  • Computing Q, K, V projections for all N tokens
  • Computing the N×N attention score matrix
  • Weighted sum over N value vectors

Generating a 500-token response with a 1000-token context means 500 forward passes, each over at least 1000 tokens. That's 500,000 token-computations just for one response. Without optimization, this is prohibitively slow.

The KV Cache: Reuse Don't Recompute

The key insight: for a given input prefix, the Key and Value matrices for all past tokens never change. When generating token N+1, the only new computation needed is the Q/K/V for the new token — then attend to the cached K and V from all previous tokens.

# Without KV cache (naive): full recomputation every step
for step in range(output_length):
    all_tokens = input_tokens + generated_so_far
    output = model.forward(all_tokens)   # O(N²) attention every step
    next_token = sample(output[-1])
    generated_so_far.append(next_token)

# With KV cache: only compute for new token, attend to cached K/V
kv_cache = {}
for step in range(output_length):
    new_token = input_tokens[-1] if step == 0 else generated_so_far[-1]
    output, kv_cache = model.forward_with_cache(new_token, kv_cache)
    next_token = sample(output)          # O(N) per step — attend to cache
    generated_so_far.append(next_token)

With the KV cache, each generation step is O(N) in the sequence length rather than O(N²). The N tokens in the cache are attended to, but the K and V matrices are not recomputed — they're read from GPU memory.

KV Cache Memory: The Hard Math

For each transformer layer and each attention head, the cache stores K and V matrices for every token. The memory per token is:

# Memory formula for KV cache
bytes_per_token = (
    2            # K and V (two matrices)
    * num_layers
    * num_heads
    * head_dim    # d_model / num_heads
    * bytes_per_element  # 2 for fp16, 4 for fp32
)

# Llama 3 8B: 32 layers, 8 KV heads (GQA), head_dim=128, fp16
bytes_per_token = 2 * 32 * 8 * 128 * 2  # = 131,072 bytes = 128 KB per token

# At 8192 token context: 128 KB × 8192 = 1 GB of KV cache
# At 128k token context: 128 KB × 131072 = 16 GB — more than the model weights!

This is why long contexts are expensive. A Llama 3 8B model weighs ~16GB in fp16. At a 128k context window, the KV cache alone requires another 16GB — you need 32GB of GPU memory just to serve a single request at maximum context length.

Grouped Query Attention (GQA) and Multi-Query Attention (MQA)

The standard multi-head attention has one K and V head per Q head. This means the KV cache scales with the number of attention heads. Newer architectures reduce this:

VariantK/V HeadsKV Cache SizeQuality vs MHAUsed By
Multi-Head Attention (MHA)= num_heads (32+)Baseline (100%)BaselineOriginal GPT models, BERT
Multi-Query Attention (MQA)1 shared KV head~3% of MHASlight degradationFalcon, early Gemini
Grouped Query Attention (GQA)Groups (e.g., 8 KV heads for 32 Q heads)~25% of MHANear-identical to MHALlama 3, Mistral, Gemma

GQA is now the default for efficient modern models. Llama 3 8B uses 8 KV heads for 32 Q heads — a 4× memory reduction vs standard MHA with negligible quality loss.

PagedAttention: Fixing Memory Fragmentation

The KV cache grows dynamically as tokens are generated, but GPU memory must be pre-allocated in contiguous blocks. Traditional inference servers reserve the maximum context length upfront, wasting memory on short requests and preventing batching of many requests simultaneously.

PagedAttention (vLLM, 2023) applies the virtual memory paging concept to KV caches: split the cache into fixed-size "pages" and maintain a page table mapping logical positions to physical GPU memory blocks. Short requests use few pages; long requests use many. Different requests share the GPU memory pool, and common prompt prefixes can be physically shared across requests (prefix caching).

This allows vLLM to serve 20–30× more concurrent requests on the same GPU compared to naive allocation, dramatically increasing throughput for production APIs.

The Prefill vs Decode Bottleneck

LLM inference has two distinct phases with different performance characteristics:

  • Prefill: Process the entire input prompt in parallel. This is compute-bound — all tokens run through all layers simultaneously. Fast on modern GPUs.
  • Decode: Generate output tokens one at a time. This is memory-bandwidth-bound — each step reads the full KV cache from GPU HBM. Speed is limited by how fast the GPU can read memory, not how fast it can compute.

A 100-token prompt with a 500-token output spends ~80% of wall-clock time in the decode phase. This is why Time to First Token (TTFT) is fast but tokens per second is limited by GPU memory bandwidth (~3 TB/s on H100, 2 TB/s on A100).

Speculative Decoding

Since decode is bottlenecked by memory bandwidth, the GPU is underutilized computationally. Speculative decoding exploits this: a small "draft" model (3B parameters) generates K candidate tokens in parallel, then the large model (70B) verifies all K tokens in a single forward pass.

If the large model agrees with the draft (likely ~70–80% of the time per token), you've generated K tokens at the latency of 1 verification step. When disagreement occurs, you revert to the largest agreed token and continue. Net result: 2–3× throughput improvement on the large model with identical output quality.

✅ FlashAttention in One Sentence

Standard attention materializes the full N×N score matrix in GPU SRAM, which overflows for long sequences. FlashAttention computes attention in tiles that each fit in SRAM, eliminating the memory overflow and reducing the N² SRAM reads/writes to O(N) — enabling 128k+ context lengths that would otherwise OOM on any GPU.

Practical Latency Budgeting

Context LengthKV Cache (Llama 3 8B)First Token LatencyDecode Speed
2,048 tokens256 MB~50ms~40 tok/s (A10G)
8,192 tokens1 GB~200ms~35 tok/s
32,768 tokens4 GB~800ms~25 tok/s
131,072 tokens16 GB~3s~10 tok/s

For latency-sensitive applications, keep prompts short. For throughput-sensitive applications (batch processing), maximize GPU memory utilization with continuous batching — filling decode idle time with other requests' prefill phases.