Why RNNs Had to Go

Recurrent networks process sequences token-by-token. To compute the hidden state at position t, you need the hidden state from position t-1. That dependency chain makes parallelism impossible — you cannot process a 1,000-token sequence on 1,000 cores simultaneously. Transformers eliminate this by treating the entire sequence as a set and computing all token relationships in one matrix operation. GPU utilization jumps from ~10% (RNN sequential bottleneck) to near 100%.

The second problem RNNs have is gradient flow. In a 512-token sequence, a gradient from position 512 must travel back through 512 multiply-accumulate steps to reach position 1. At each step it's multiplied by the recurrent weight matrix — if any eigenvalue is <1, the gradient vanishes; if >1, it explodes. Attention connects any two positions with a single step, so gradient paths are O(1) regardless of sequence length.

From Token to Vector: Input Embeddings

Before attention can run, each token is converted to a dense vector. An embedding table maps each token ID to a vector of dimension d_model. For GPT-2, d_model = 768. For GPT-3, d_model = 12288. The full input to a transformer layer is a matrix of shape [seq_len, d_model] — one row per token.

Attention is permutation-invariant: if you shuffle the rows, you get the same attention outputs (just reordered). This is why positional encodings are added — they break the symmetry by injecting position information into each row. The original paper used sinusoidal encodings; modern models mostly use learned positional embeddings or RoPE (Rotary Position Embedding).

The Q/K/V Framework

Every token in the sequence simultaneously plays three roles, controlled by three learned weight matrices:

  • Query (Q): "What information am I looking for?"
  • Key (K): "What information do I offer to others?"
  • Value (V): "What do I actually return when selected?"

Given input matrix X of shape [seq_len, d_model] and weight matrices W_Q, W_K, W_V each of shape [d_model, d_k]:

# Shape: [seq_len, d_k]  (typically d_k = d_model / num_heads)
Q = X @ W_Q
K = X @ W_K
V = X @ W_V

These projections allow the model to use different "views" of the same token for different roles. A token can ask a different question (Q) than what it advertises as its key (K).

Scaled Dot-Product Attention

The full computation is one formula:

Attention(Q, K, V) = softmax( Q @ K.T / sqrt(d_k) ) @ V

Step by step:

  1. Q @ K.T — dot product of every query against every key. Shape: [seq_len, seq_len]. Each entry is a raw attention score: how relevant is token j to token i?
  2. / sqrt(d_k) — scaling. When d_k is large (e.g., 64), raw dot products become large in magnitude, pushing softmax into its saturation region where gradients vanish. Dividing by sqrt(d_k) keeps variance at 1 regardless of dimensionality.
  3. softmax(...) — normalizes each row to a probability distribution. Each row now sums to 1 and represents "how much of each token to attend to."
  4. @ V — weighted sum of value vectors. Each output row is a convex combination of all value vectors, weighted by attention scores.
import numpy as np

def scaled_dot_product_attention(Q, K, V, mask=None):
    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)   # [seq, seq]
    if mask is not None:
        scores = scores + mask * -1e9  # large negative → ~0 after softmax
    weights = np.exp(scores) / np.exp(scores).sum(axis=-1, keepdims=True)
    return weights @ V, weights

# Example: seq_len=4, d_k=8
np.random.seed(42)
Q = np.random.randn(4, 8)
K = np.random.randn(4, 8)
V = np.random.randn(4, 8)
out, attn_weights = scaled_dot_product_attention(Q, K, V)
# out.shape → (4, 8): each token now has context from all others
⚠️ The Causal Mask in Decoder Models

Decoder-only models (like GPT) must not let token i attend to token j > i — that would be cheating by looking at future tokens during generation. The causal mask sets the upper triangle of the score matrix to -∞ before softmax, making those weights zero. Encoder models (like BERT) use no mask — every token attends to every other token bidirectionally.

Multi-Head Attention: Running Attention in Parallel Subspaces

A single attention head can only capture one type of relationship per token. Multi-head attention runs h independent attention heads simultaneously, each with its own W_Q, W_K, W_V projections into a lower-dimensional subspace (d_k = d_model / h). The intuition: one head might learn syntactic relationships (subject-verb), another might learn coreference (pronoun-antecedent), another might learn positional proximity.

def multi_head_attention(X, W_Qs, W_Ks, W_Vs, W_O):
    # W_Qs/W_Ks/W_Vs: list of [d_model, d_k] weight matrices, one per head
    # W_O: [h * d_k, d_model] output projection
    heads = []
    for W_Q, W_K, W_V in zip(W_Qs, W_Ks, W_Vs):
        Q, K, V = X @ W_Q, X @ W_K, X @ W_V
        head_out, _ = scaled_dot_product_attention(Q, K, V)
        heads.append(head_out)
    concat = np.concatenate(heads, axis=-1)  # [seq, h * d_k]
    return concat @ W_O                         # [seq, d_model]

GPT-3 uses 96 heads with d_k = 128 and d_model = 12288. The total attention parameter count per layer is 4 × d_model² (for the four projection matrices including the output).

The Full Transformer Block

One transformer "layer" applies two sub-layers in sequence, each wrapped with a residual connection and layer normalization:

# Sub-layer 1: Multi-Head Self-Attention
x = x + MultiHeadAttention(LayerNorm(x))

# Sub-layer 2: Position-wise Feed-Forward Network
x = x + FFN(LayerNorm(x))
# FFN = Linear(d_model → 4*d_model) → GELU → Linear(4*d_model → d_model)

The residual connections (x + ...) are critical. They create "highways" that let gradients flow directly from output to input without passing through the attention or FFN operations. Without them, deep stacks of transformer layers would not train stably.

Encoder-Only vs Decoder-Only vs Encoder-Decoder

ArchitectureAttention TypeTrained ForExamples
Encoder-onlyBidirectional (sees full sequence)Classification, NER, embeddingsBERT, RoBERTa, DeBERTa
Decoder-onlyCausal (sees only past tokens)Text generation, completionGPT-4, Llama 3, Mistral
Encoder-decoderBidirectional encoder + causal decoderTranslation, summarizationT5, BART, Flan-T5

Most modern language models are decoder-only. The architecture's autoregressive nature (predict next token given past tokens) is a natural fit for both pretraining (next-token prediction) and inference (generation).

Computational Complexity

The quadratic term in transformers is real: attention computes an [N×N] score matrix where N is the sequence length. At N=8192 tokens, that's 67M scores per head per layer. For a 32-layer, 32-head model, that's 69B score computations per forward pass — just for attention. This is why context window extension (from 2048 to 128k+ tokens) required algorithmic innovations like FlashAttention, which computes attention in tiles to avoid materializing the full score matrix in GPU SRAM.

✅ What to Read Next

Understanding attention is the foundation. The next logical step is understanding what tokens actually are — before any embedding happens. The tokenization article explains BPE from first principles and shows why " hello" is two tokens, not one.