Transformers: The Complete Story — From Attention to KV Caching

Before transformers, sequence modelling meant RNNs and LSTMs. They process one token at a time, left to right, updating a hidden state $h_t$ at each step:

Symbol	What It Means	Typical Value (GPT-2 Small)
$B$	Batch size — number of sequences processed simultaneously	32
$S$	Sequence length — number of tokens in one sequence	1024
$d_{model}$	Model dimension — size of each token's embedding vector	768
$n_{heads}$	Number of attention heads	12
$d_{head}$	Dimension per head — always $d_{model} / n_{heads}$	64
$L$	Number of transformer layers (depth)	12
$X$	Input matrix — shape $(S, d_{model})$, all tokens stacked	—
$Q$	Query matrix — shape $(S, d_{head})$ per head	—
$K$	Key matrix — shape $(S, d_{head})$ per head	—
$V$	Value matrix — shape $(S, d_{head})$ per head	—
$W_Q, W_K, W_V$	Learned projection weight matrices — shape $(d_{model}, d_{model})$ each	—
$W_O$	Output projection weight matrix — shape $(d_{model}, d_{model})$	—
$t$	Current generation step (which token is being produced)	—
FLOPs	Floating point operations — how we measure compute cost	—

2 The Core Idea: Attention

For "The cat sat on the mat because it was tired" — when processing "it", attention scores every token directly:

The output for "it" = $0.71 \times \text{rep}(\text{"cat"}) + 0.08 \times \text{rep}(\text{"mat"}) + \ldots$ — a context-aware blend.

3 Q, K, V — What They Are

Every token produces three vectors via learned linear projections $W_Q, W_K, W_V$:

Vector	Database Analogy	Role in Attention
Query $Q$	"What am I looking for?"	Current token's search request
Key $K$	"What do I contain?"	Each token advertising its content
Value $V$	"What do I actually give you?"	Information to aggregate if selected

4 Scaled Dot-Product Attention

5 Why Scale by √d_head? (With Numbers)

Assume $Q$ and $K$ are initialised from $\mathcal{N}(0,1)$. Each element has mean 0, variance 1. The dot product $Q_i \cdot K_j = \sum_{l=1}^{d_{head}} Q_{il} \cdot K_{jl}$ is a sum of $d_{head}$ products.

6 Why One Head Isn't Enough

7 The Full Dimension Flow

Tracing exact shapes through GPT-2 Small: $B=1$, $S=1024$, $d_{model}=768$, $n_{heads}=12$, $d_{head}=64$.

8 Weight Matrices: Still d_model × d_model

9 How Matrix Multiply Cost Works (From Scratch)

Dot product of two length-$k$ vectors: $k$ multiplications + $(k-1)$ additions $\approx 2k$ operations.

Matrix multiply $(A: m \times k)$ by $(B: k \times p)$: result is $(m \times p)$. Each of the $m \times p$ entries is a length-$k$ dot product.

10 Compute: Why O(S²·d_model)

Operation 1 — Linear Projections (Q, K, V, O)

$X W_Q$: $(S, d_{model}) \times (d_{model}, d_{model}) \to (S, d_{model})$. Cost $= O(S \cdot d_{model}^2)$. Four projections total: $O(S \cdot d_{model}^2)$.

Operation 2 — Attention Scores QKᵀ

Per head: $Q$ is $(S, d_{head})$, $K^\top$ is $(d_{head}, S)$. Result: $(S, S)$.

All $n_{heads}$ heads: $O(n_{heads} \cdot S^2 d_{head}) = O(S^2 \cdot n_{heads} \cdot d_{head}) = O(S^2 \cdot d_{model})$ since $n_{heads} \cdot d_{head} = d_{model}$.

Operation 3 — Weighted Sum (weights × V)

$(S, S) \times (S, d_{head})$ per head. Cost $= O(S^2 d_{head})$ per head $= O(S^2 d_{model})$ total. Same as $QK^\top$.

Total

11 Memory: Why O(S²)

Compute counts operations performed. Memory counts values that must be stored simultaneously.

The attention score matrix has shape $(B, n_{heads}, S, S)$. That's $B \times n_{heads} \times S^2$ numbers. The $d_{head}$ dimension was consumed by the dot product — it doesn't appear in the output matrix. So storing the score matrix costs $O(S^2)$ per batch element per head.

12 The Problem KV Caching Solves

KV caching is an inference-time optimisation only — it doesn't apply during training.

At each step, in every transformer layer, the model computes $K$ and $V$ for every token in the current input.

13 Why K and V but not Q?

14 Cache Size: Where 2·S·L·d_model Comes From

15 Does It Actually Save That Much? (With Numbers)

Without KV cache: compute $K$ and $V$ for all $t-1$ tokens per layer. Each is $(t-1, d_{model}) \times (d_{model}, d_{model})$: cost $= O((t-1) \cdot d_{model}^2)$ per layer.

With KV cache: compute $K$ and $V$ only for the new token. That's $(1, d_{model}) \times (d_{model}, d_{model})$: cost $= O(d_{model}^2)$ per layer — independent of $t$.

16 The Three Types of Masking

Masking = set certain attention scores to $-\infty$ before softmax so they get zero weight. Three distinct situations need it:

Mask	Problem It Solves	Where Applied	Implementation
Padding mask	Batch sequences are padded to equal length — PAD tokens shouldn't influence attention	Attention scores	`attention_mask` (1=real, 0=pad)
Loss mask	PAD tokens shouldn't contribute to the loss	Cross-entropy loss	`ignore_index=0`
Causal mask	Autoregressive models can't see future tokens during training	Attention scores	Upper triangular $-\infty$, built into model

17 The Padding Bug

Bug 1 — No attention mask: Real tokens attend to PAD tokens, pulling garbage into their representations.

Bug 2 — No loss mask: Model penalised for predicting PAD token positions — meaningless noise.

Symptom of both bugs: model performs poorly on short sequences (more padding = more noise), well on long ones. Length-correlated degradation is the tell.

18 The Correct Training Loop

#	What	Effect if Missing	Severity
2	`zero_grad()`	Gradients accumulate — wrong update every step	🔴 Immediate failure
4	`ignore_index`	PAD tokens add noise to loss — silent degradation	🟠 Silent
3	`attention_mask`	PAD pollutes representations — silent degradation	🟠 Silent
8	`model.train()` after eval	Dropout/BN in wrong mode — conditional on having called eval()	🟡 Conditional
5	grad clipping	Gradient explosion on long sequences	🟡 Rare for transformers

19 Why Warmup LR for Transformers?

Reason 1 — Adam's early estimates are unreliable. Adam maintains $m_t$ (EMA of gradients) and $v_t$ (EMA of squared gradients). At step 1, $m_1 = 0.1 \times g_1$ — based on one sample, highly noisy. A large LR multiplies this noise into a large bad update.

Reason 2 — Transformers amplify instability. Multi-head attention + layer norm + residual connections across many layers means a bad early update propagates through all $L$ layers. CNNs are locally connected and more forgiving.

20 Every Problem the Vanilla Transformer Had, and Who Fixed It

The 2017 "Attention is All You Need" transformer was a breakthrough — but it had real problems. Here's how each was addressed, in order. Each entry below will become its own deep-dive post.

2017

Vanilla Transformer

Attention Is All You Need — Vaswani et al.

Introduced self-attention, multi-head attention, positional encoding. Encoder-decoder architecture for translation. Replaced RNNs entirely for sequence-to-sequence tasks.

Problems: O(S²) memory, fixed positional encoding doesn't generalise, no way to run long contexts.

2018

GPT-1 & BERT — The Pretraining Era Begins

GPT (Radford et al.) · BERT (Devlin et al.)

GPT: decoder-only, causal masking, autoregressive pretraining. BERT: encoder-only, masked language modelling, bidirectional context. Both showed that large-scale pretraining + fine-tuning transfers to nearly any NLP task.

Problem: Still O(S²), fixed context window (512 for BERT, 1024 for GPT), expensive to train.

2019

Transformer-XL — Breaking the Fixed Context Window

Transformer-XL — Dai et al.

Problem: Transformer processes each context window independently. No memory across windows. Long-range dependencies get truncated.

Fix: Segment-level recurrence. Cache hidden states from previous segments; attend to them in the current segment. Also introduced relative positional encodings.

First real attempt at extending context without quadratic memory blowup.

2020

Longformer & BigBird — Sparse Attention

Longformer — Beltagy et al. · BigBird — Zaheer et al.

Problem: Full attention is O(S²). At S=4096, the attention matrix alone is 64GB. Infeasible for long documents.

Fix: Replace full attention with sparse patterns. Longformer: each token attends to a local window + a few global tokens (e.g. [CLS]). BigBird: local + global + random attention. Reduces to O(S) or O(S·window_size).

Enabled processing of long documents (legal, scientific papers) that were previously impossible.

2021

RoPE — Rotary Positional Embeddings

RoFormer — Su et al.

Problem: Absolute positional encodings (sinusoidal or learned) don't generalise to sequence lengths longer than seen during training.

Fix: Encode position as a rotation in the Q and K vectors. Relative position between tokens is naturally captured in the dot product. Generalises to longer sequences than training length.

RoPE is now the standard in most modern LLMs — LLaMA, Mistral, GPT-4 all use it.

2022

FlashAttention — Fixing the Memory Wall

FlashAttention — Dao et al.

Problem: The $(S, S)$ attention score matrix must be written to and read from GPU HBM (slow memory). At S=4096 this is 64MB per layer per batch — a massive memory bottleneck even if compute is fine.

Fix: Tiled computation. Split Q, K, V into blocks; compute attention scores, softmax, and weighted sum block-by-block within fast SRAM (on-chip memory). Never materialise the full $(S, S)$ matrix in HBM. Recompute during backprop instead of storing.

Same O(S²) compute — but 5–20× faster in practice due to memory access patterns. Became the standard attention implementation. FlashAttention-2 (2023) and FlashAttention-3 (2024) pushed further.

2022

Multi-Query Attention (MQA) & Grouped-Query Attention (GQA)

MQA — Shazeer (2019, widely adopted 2022+) · GQA — Ainslie et al.

Problem: KV cache grows as $2 \cdot S \cdot L \cdot d_{model}$ bytes — at 10GB+ for large models, inference is memory-bound. Can't serve many users simultaneously.

MQA: All $n_{heads}$ query heads share a single K and V head. KV cache shrinks by $n_{heads}\times$. GQA (compromise): group $n_{heads}$ into $G$ groups; each group shares one K,V. LLaMA 2/3 uses GQA with $G=8$.

GQA reduces KV cache by 8× vs multi-head attention with minimal quality loss. Now standard in production LLMs.

2023

Sliding Window Attention & Mixture of Experts — Mistral

Mistral 7B — Jiang et al.

Problem: Full attention over long contexts is expensive even with FlashAttention. Most tokens don't need to attend to tokens far away.

Sliding window attention: each token attends only to the $W$ most recent tokens (e.g. $W=4096$). Combined with GQA and RoPE. Mixtral adds sparse MoE layers — only 2 of 8 expert FFN layers active per token, cutting compute while keeping parameters high.

2024

Linear Attention & State Space Models — Mamba

Mamba — Gu & Dao

Problem: Attention is fundamentally O(S²). Even with all the tricks, processing 1M-token contexts remains expensive. Can we get rid of the quadratic entirely?

Mamba: replaces attention with selective state space models (SSMs). Key innovation: input-dependent (selective) state transitions. Achieves O(S) compute and O(1) memory per step during inference. No attention matrix at all.

Competitive with transformers on language tasks at medium scale. Not yet clearly dominant — hybrid Mamba+attention models (Jamba) show promise. Active research area.

2025

DeepSeek-V3 & Multi-Head Latent Attention (MLA)

DeepSeek-V3 — DeepSeek-AI

Problem: Even GQA has a large KV cache. MQA sacrifices quality. Can we get small KV cache without quality loss?

Multi-Head Latent Attention (MLA): compress K and V into a low-rank latent vector $c \in \mathbb{R}^{d_c}$ where $d_c \ll d_{model}$. Cache only $c$ per token instead of full K,V. Reconstruct K,V from $c$ at attention time via learned up-projection. KV cache shrinks by $(2 \cdot n_{heads} \cdot d_{head}) / d_c \approx 5$–$10\times$ vs MHA with no quality loss.

Combined with MoE and FP8 training, DeepSeek-V3 achieved GPT-4 level performance at a fraction of training cost. Caused significant industry attention.

21 Interview Summary

🎯 The Full Narrative

"RNNs process sequentially and compress everything into a hidden state — causing vanishing gradients and preventing parallelisation.

Transformers replace this with attention: $\text{softmax}(QK^\top/\sqrt{d_{head}})V$. Every token directly queries every other. We scale by $\sqrt{d_{head}}$ because dot product variance grows as $d_{head}$, and large values saturate softmax — dividing restores unit variance.

Multi-head attention runs $n_{heads}$ independent attention operations in parallel, each specialising in different relationship types, using the same total parameters as single-head attention ($n_{heads} \times d_{head} = d_{model}$).

Attention is $O(S^2 \cdot d_{model})$ compute because $QK^\top$ is $(S, d_{head}) \times (d_{head}, S) = O(S^2 d_{head})$ per head, times $n_{heads}$ gives $O(S^2 d_{model})$. Memory is $O(S^2)$ because the score matrix is $(S, S)$ — $d_{head}$ was consumed by the dot product.

KV caching saves inference compute by storing previous tokens' $K$ and $V$ — they're deterministic functions of those tokens and never change. $Q$ is always the fresh new token's question, so it's never cached. Without caching, projection cost is $O(S^2 d_{model})$ total. With caching, $O(S \cdot d_{model}^2)$ — linear in $S$. Cache size is $2SLd_{model}$ numbers = $4SLd_{model}$ bytes in float16.

Three masking types: attention mask for padding, ignore_index for loss, causal mask for autoregressive models. Forgetting the loss mask causes silent length-correlated degradation. LR warmup stabilises early Adam estimates and prevents cascading instability in deep networks."

TL;DR Cheatsheet

Concept	One Line	Key Number
Attention formula	$\text{softmax}(QK^\top/\sqrt{d_{head}})V$	—
Why $\sqrt{d_{head}}$	Dot product std dev = $\sqrt{d_{head}}$; divide to restore unit variance	$d_{head}=64$ → divide by 8
Multi-head	Same params ($n_{heads} \times d_{head} = d_{model}$), $n_{heads}$ specialised patterns	GPT-2: 12 heads × 64 = 768
Matrix multiply cost	$(m,k)\times(k,p) = O(m\cdot k\cdot p)$ — multiply all three dims	Always
Attention compute	$O(S^2 \cdot d_{model})$ from $QK^\top$ being $(S,S)$ times $n_{heads}$ heads	Double $S$ → 4× FLOPs
Attention memory	$O(S^2)$ — score matrix is $(S,S)$, $d_{head}$ consumed by dot product	$S=32k$ → 49 GB/layer
KV cache what	Store K,V of prev tokens; Q is always fresh so never cached	—
KV cache saves	Projection cost: $O(S^2 d_{model}) \to O(S \cdot d_{model}^2)$ across all steps	$S=10k$ → ~13× faster projections
KV cache size	$2SLd_{model}$ numbers, $4SLd_{model}$ bytes float16	LLaMA 70B, $S=4k$ → 10 GB
Padding mask	Add $-\infty$ to PAD positions before softmax	`attention_mask`
Loss mask	Skip PAD token IDs in cross-entropy	`ignore_index=0`
Causal mask	Upper triangular $-\infty$ prevents future token leakage	GPT-style only

Contents

Why Transformers?

1 The Problem with RNNs