Companion Volume — Deep Dives & Implementation

ML First Principles, Part II:
From Equations to Working Code

Extended worked examples, implementation patterns, the training pipeline in practice,
inference mechanics, BERT vs GPT, and the topics that bridge theory to engineering.

Complete Backpropagation Walkthrough
- Full chain rule derivation · Every gradient computed · Numerical trace
Transformer — Complete Numerical Trace
- Token embedding through final prediction · Every matrix multiplication
Implementation Patterns
- NumPy from scratch · PyTorch idioms · Training loops · Common bugs
The LLM Training Pipeline
- Data collection · Tokenization · Distributed training · Mixed precision
Inference — How LLMs Generate Text
- Autoregressive generation · Temperature · Top-k/p · KV cache · Beam search
BERT vs GPT — Two Paradigms
- Masked LM vs Causal LM · When to use which · Fine-tuning
Regularization & Generalization
- Overfitting · Dropout · Weight decay · Batch/Layer norm · Early stopping
Optimizers — From SGD to Adam
- Momentum · RMSProp · Adam derivation · Learning rate scheduling
Embeddings Deep Dive — Geometry of Meaning
- Analogy arithmetic · Bias in embeddings · Contextual embeddings
Interpretability — What Do Models Learn?
- Attention visualization · Probing classifiers · Superposition · Circuits

Deep Dive I

Complete Backpropagation Walkthrough

We will trace every single computation through a small network — forward and backward — with exact numbers. No hand-waving. When you finish this section, you should be able to compute gradients for any network on paper.

Network Setup

INPUT HIDDEN LAYER OUTPUT (ReLU) (Sigmoid + BCE) x₁ = 1.0 ──→ ┌────┐ ╲ │ h₁ │──→╲ x₂ = 0.5 ──→ └────┘ ╲ ┌────┐ ╱ ──→│ o₁ │──→ ŷ ──→ Loss x₁ = 1.0 ──→ ┌────┐ ╱ └────┘ ╲ │ h₂ │──→╱ x₂ = 0.5 ──→ └────┘ Weights: W¹ = [[0.15, 0.20], b¹ = [0.35, 0.35] [0.25, 0.30]] W² = [[0.40, 0.45]] b² = [0.60] True label: y = 1

Step 1: Forward Pass — Hidden Layer

Compute pre-activation \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}\):

z_1^{[1]} = 0.15(1.0) + 0.20(0.5) + 0.35 = 0.15 + 0.10 + 0.35 = 0.60$$ $$z_2^{[1]} = 0.25(1.0) + 0.30(0.5) + 0.35 = 0.25 + 0.15 + 0.35 = 0.75

Apply ReLU activation \(\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})\):

a_1^{[1]} = \max(0, 0.60) = 0.60 \qquad a_2^{[1]} = \max(0, 0.75) = 0.75

Step 2: Forward Pass — Output Layer

Compute pre-activation and sigmoid:

z^{[2]} = 0.40(0.60) + 0.45(0.75) + 0.60 = 0.24 + 0.3375 + 0.60 = 1.1775$$ $$\hat{y} = a^{[2]} = \sigma(1.1775) = \frac{1}{1 + e^{-1.1775}} = \frac{1}{1 + 0.3083} = 0.7644

Step 3: Compute Loss

Binary cross-entropy with \(y = 1\):

\mathcal{L} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] = -\log(0.7644) = 0.2686

Step 4: Backward Pass — Output Layer Error

The key identity for sigmoid + BCE: \(\delta^{[2]} = \hat{y} - y\):

Output error signal $$\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial z^{[2]}} = \hat{y} - y = 0.7644 - 1 = -0.2356$$

Let's prove the elegant identity \(\delta = \hat{y} - y\) from scratch.

We need \(\frac{\partial \mathcal{L}}{\partial z}\). By the chain rule:
\(\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}\)

Term 1: \(\frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}}[-y\log\hat{y} - (1-y)\log(1-\hat{y})] = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)

Term 2: \(\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1 - \sigma(z)) = \hat{y}(1 - \hat{y})\)

Combined:
\(\frac{\partial \mathcal{L}}{\partial z} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y})\)
\(= -y(1-\hat{y}) + (1-y)\hat{y}\)
\(= -y + y\hat{y} + \hat{y} - y\hat{y}\)
\(= \hat{y} - y \quad \blacksquare\)

This cancellation is not a coincidence. It arises because the sigmoid is the natural link function for the Bernoulli distribution — this is from the theory of exponential family distributions and generalized linear models.

Step 5: Gradients for Output Weights

Output weight gradients $$\frac{\partial \mathcal{L}}{\partial W_{1}^{[2]}} = \delta^{[2]} \cdot a_1^{[1]} = (-0.2356)(0.60) = -0.1414$$ $$\frac{\partial \mathcal{L}}{\partial W_{2}^{[2]}} = \delta^{[2]} \cdot a_2^{[1]} = (-0.2356)(0.75) = -0.1767$$ $$\frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} = -0.2356$$

Step 6: Propagate Error to Hidden Layer

Error propagation through weights $$\delta_1^{[1]\text{(pre-ReLU)}} = W_1^{[2]} \cdot \delta^{[2]} = 0.40 \times (-0.2356) = -0.0943$$ $$\delta_2^{[1]\text{(pre-ReLU)}} = W_2^{[2]} \cdot \delta^{[2]} = 0.45 \times (-0.2356) = -0.1060$$

Apply ReLU derivative: since both \(z_1^{[1]} = 0.60 > 0\) and \(z_2^{[1]} = 0.75 > 0\), the ReLU derivative is 1 for both:

\delta_1^{[1]} = -0.0943 \times 1 = -0.0943$$ $$\delta_2^{[1]} = -0.1060 \times 1 = -0.1060

Step 7: Gradients for Hidden Weights

Hidden weight gradients $$\frac{\partial \mathcal{L}}{\partial W_{11}^{[1]}} = \delta_1^{[1]} \cdot x_1 = (-0.0943)(1.0) = -0.0943$$ $$\frac{\partial \mathcal{L}}{\partial W_{12}^{[1]}} = \delta_1^{[1]} \cdot x_2 = (-0.0943)(0.5) = -0.0471$$ $$\frac{\partial \mathcal{L}}{\partial W_{21}^{[1]}} = \delta_2^{[1]} \cdot x_1 = (-0.1060)(1.0) = -0.1060$$ $$\frac{\partial \mathcal{L}}{\partial W_{22}^{[1]}} = \delta_2^{[1]} \cdot x_2 = (-0.1060)(0.5) = -0.0530$$

Step 8: Update All Weights (\(\eta = 0.5\))

Since \(\delta\) was negative (model under-predicted — \(\hat{y} = 0.76 < 1\)), all gradients are negative, so all weights increase. This makes the output larger, pushing \(\hat{y}\) closer to 1.

Updated weights $$W_1^{[2]}: \; 0.40 - 0.5(-0.1414) = 0.40 + 0.0707 = 0.4707$$ $$W_2^{[2]}: \; 0.45 - 0.5(-0.1767) = 0.45 + 0.0884 = 0.5384$$ $$b^{[2]}: \; 0.60 - 0.5(-0.2356) = 0.60 + 0.1178 = 0.7178$$ $$W_{11}^{[1]}: \; 0.15 - 0.5(-0.0943) = 0.15 + 0.0471 = 0.1971$$ $$\text{...and similarly for all hidden weights}$$

The backprop recipe, summarized:
1. Forward pass: compute all \(\mathbf{z}^{[l]}\) and \(\mathbf{a}^{[l]}\), caching each one.
2. Output error: \(\delta^{[L]} = \hat{y} - y\) (for sigmoid + BCE).
3. For each layer \(l = L-1, L-2, \ldots, 1\):
— Propagate: \(\boldsymbol{\delta}^{[l]} = (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \odot \phi'(\mathbf{z}^{[l]})\)
— Gradients: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\)
4. Update: \(\mathbf{W}^{[l]} \leftarrow \mathbf{W}^{[l]} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}}\)

· · ·

Deep Dive II

Transformer — Complete Numerical Trace

Let's trace a minimal transformer decoder through one forward pass, computing every matrix multiplication. We'll use a vocabulary of 5 tokens, embedding dimension \(d = 4\), one attention head, and a sequence of 3 tokens.

Setup

Vocabulary: {0: "the", 1: "cat", 2: "sat", 3: "on", 4: "mat"}.
Input sequence: ["the", "cat", "sat"] → token IDs [0, 1, 2].
Target: predict the next token at each position.

Step 1: Token Embedding

Embedding matrix \(\mathbf{E} \in \mathbb{R}^{5 \times 4}\) (lookup table):

\mathbf{E} = \begin{bmatrix} 0.2 & 0.1 & -0.3 & 0.5 \\ -0.1 & 0.4 & 0.2 & -0.2 \\ 0.3 & -0.2 & 0.1 & 0.4 \\ 0.1 & 0.3 & -0.1 & 0.2 \\ -0.3 & 0.2 & 0.4 & -0.1 \end{bmatrix}

Look up rows 0, 1, 2 to get \(\mathbf{X} \in \mathbb{R}^{3 \times 4}\):

\mathbf{X} = \begin{bmatrix} 0.2 & 0.1 & -0.3 & 0.5 \\ -0.1 & 0.4 & 0.2 & -0.2 \\ 0.3 & -0.2 & 0.1 & 0.4 \end{bmatrix}

Step 2: Positional Encoding (simplified)

Add position vectors (using simple learned encodings for clarity):

\mathbf{P} = \begin{bmatrix} 0.0 & 0.0 & 0.1 & 0.0 \\ 0.0 & 0.1 & 0.0 & 0.0 \\ 0.1 & 0.0 & 0.0 & 0.0 \end{bmatrix} \qquad \mathbf{X}' = \mathbf{X} + \mathbf{P}

Step 3: Compute Q, K, V

Using projection matrices \(\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{4 \times 4}\). For simplicity, let them be identity-like with small perturbations (in practice they're learned):

\mathbf{Q} = \mathbf{X}'\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}'\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}'\mathbf{W}^V

Suppose after projection we get (rounded):

\mathbf{Q} = \begin{bmatrix}0.22 & 0.12 & -0.18 & 0.48\\-0.08 & 0.50 & 0.18 & -0.22\\0.38 & -0.18 & 0.12 & 0.42\end{bmatrix}

Step 4: Attention Scores

Compute \(\mathbf{S} = \mathbf{Q}\mathbf{K}^T / \sqrt{d_k}\) where \(d_k = 4\), so divide by 2:

\mathbf{S}_{\text{raw}} = \mathbf{Q}\mathbf{K}^T = \begin{bmatrix} s_{00} & s_{01} & s_{02} \\ s_{10} & s_{11} & s_{12} \\ s_{20} & s_{21} & s_{22} \end{bmatrix}

For example, \(s_{00} = \mathbf{q}_0^T \mathbf{k}_0\) measures how much "the" (as query) attends to "the" (as key).

Step 5: Causal Mask

Apply the causal mask — token \(i\) can only attend to positions \(\leq i\):

\mathbf{M} = \begin{bmatrix}0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0\end{bmatrix}$$ $$\mathbf{S}_{\text{masked}} = \mathbf{S}/2 + \mathbf{M}

After masking, the first row has only one valid score (position 0); second row has two valid scores; third row has all three. Then we apply softmax row-wise:

\boldsymbol{\alpha} = \text{softmax}(\mathbf{S}_{\text{masked}}) = \begin{bmatrix} 1.00 & 0.00 & 0.00 \\ 0.48 & 0.52 & 0.00 \\ 0.30 & 0.28 & 0.42 \end{bmatrix}

(Example values.) Row 1: "the" can only see itself → weight = 1.0. Row 2: "cat" attends roughly equally to "the" and "cat". Row 3: "sat" attends to all three tokens, with slightly more weight on itself.

Step 6: Weighted Sum of Values

\text{Output} = \boldsymbol{\alpha} \mathbf{V}

Each row of the output is a weighted combination of value vectors, where the weights came from the attention computation. This is the "new representation" of each token, enriched by information from tokens it attended to.

Step 7: Residual + LayerNorm + FFN + Residual + LayerNorm

\mathbf{H}' = \text{LayerNorm}(\mathbf{X}' + \text{Attention}(\mathbf{X}'))$$ $$\mathbf{H}'' = \text{LayerNorm}(\mathbf{H}' + \text{FFN}(\mathbf{H}'))

Step 8: Final Prediction

Project last layer output to vocabulary size and apply softmax:

\text{logits} = \mathbf{H}'' \mathbf{E}^T \in \mathbb{R}^{3 \times 5}$$ $$P(\text{next token} \mid \text{context}) = \text{softmax}(\text{logits})

For position 2 ("sat"), the model outputs a probability distribution over 5 tokens. If it assigns high probability to token 3 ("on"), that's correct — "the cat sat on."

In practice, the loss is computed at every position simultaneously:
Target tokens are [1, 2, 3] (shifted right: predict "cat" from "the", "sat" from "the cat", "on" from "the cat sat"). The loss is the average cross-entropy across all positions. This is extremely efficient — one forward pass gives you \(T-1\) training signals from a sequence of length \(T\).

· · ·

Deep Dive III

Implementation Patterns

From-Scratch in NumPy: 2-Layer Network

import numpy as np

# ── Data ──
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])  # XOR inputs
y = np.array([[0], [1], [1], [0]])              # XOR outputs

# ── Initialize weights (He initialization) ──
np.random.seed(42)
W1 = np.random.randn(2, 4) * np.sqrt(2.0 / 2)   # (2 inputs → 4 hidden)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * np.sqrt(2.0 / 4)   # (4 hidden → 1 output)
b2 = np.zeros((1, 1))

def sigmoid(z):  return 1 / (1 + np.exp(-z))
def relu(z):     return np.maximum(0, z)

# ── Training loop ──
lr = 0.5
for epoch in range(10000):
    # Forward
    z1 = X @ W1 + b1            # (4, 4)
    a1 = relu(z1)               # (4, 4)
    z2 = a1 @ W2 + b2           # (4, 1)
    a2 = sigmoid(z2)            # (4, 1) — predictions

    # Loss (BCE)
    loss = -np.mean(y * np.log(a2 + 1e-8) + (1-y) * np.log(1-a2 + 1e-8))

    # Backward
    dz2 = a2 - y                         # (4, 1) — output error
    dW2 = a1.T @ dz2 / 4                # (4, 1) — avg over batch
    db2 = np.mean(dz2, axis=0, keepdims=True)
    da1 = dz2 @ W2.T                     # propagate back
    dz1 = da1 * (z1 > 0).astype(float)  # ReLU derivative
    dW1 = X.T @ dz1 / 4
    db1 = np.mean(dz1, axis=0, keepdims=True)

    # Update
    W2 -= lr * dW2;  b2 -= lr * db2
    W1 -= lr * dW1;  b1 -= lr * db1

    if epoch % 2000 == 0:
        print(f"Epoch {epoch}: loss = {loss:.4f}")

print("Predictions:", a2.round(2).flatten())
# Should output approximately: [0.02, 0.98, 0.98, 0.03]

Key patterns to notice in the code:
1. Forward pass caches intermediate values (\(z_1, a_1, z_2\)) — backprop needs them.
2. Backward pass mirrors forward pass in reverse order.
3. Matrix shapes tell the story: \(dW = \text{input}^T @ \text{error}\) — the gradient for a weight matrix is always (previous layer activations)ᵀ × (current layer error).
4. He initialization (\(\times \sqrt{2/n}\)) prevents activations from vanishing or exploding at the start.

Self-Attention from Scratch

import numpy as np

def softmax(x, axis=-1):
    e = np.exp(x - x.max(axis=axis, keepdims=True))  # numerical stability
    return e / e.sum(axis=axis, keepdims=True)

def self_attention(X, Wq, Wk, Wv, mask=None):
    """
    X:    (seq_len, d_model)
    Wq/k/v: (d_model, d_k)
    mask: (seq_len, seq_len) — causal mask
    """
    Q = X @ Wq                          # (seq_len, d_k)
    K = X @ Wk                          # (seq_len, d_k)
    V = X @ Wv                          # (seq_len, d_k)

    d_k = Q.shape[-1]
    scores = Q @ K.T / np.sqrt(d_k)     # (seq_len, seq_len)

    if mask is not None:
        scores = scores + mask           # -inf for future positions

    weights = softmax(scores)            # (seq_len, seq_len)
    output  = weights @ V                # (seq_len, d_k)

    return output, weights

# ── Create causal mask ──
seq_len = 4
mask = np.triu(np.full((seq_len, seq_len), -np.inf), k=1)
#  [[  0, -inf, -inf, -inf],
#   [  0,    0, -inf, -inf],
#   [  0,    0,    0, -inf],
#   [  0,    0,    0,    0]]

Common Bugs & Fixes

Bug	Symptom	Fix
Forgetting to zero gradients	Loss increases over time	`optimizer.zero_grad()` before each backward
Wrong axis in softmax	Attention weights don't sum to 1 per row	Softmax along the keys axis (last dim)
Missing numerical stability	NaN in softmax or log	Subtract max before exp; add \(\epsilon\) inside log
Transposed weight matrix	Shape mismatch	Remember: \(\text{output} = \text{input} \times W\), so \(W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}\)
Not masking padding tokens	Model attends to garbage	Set padding positions to \(-\infty\) before softmax
Learning rate too high	Loss oscillates or explodes	Start with \(3 \times 10^{-4}\) for Adam; use warmup

· · ·

Deep Dive IV

The LLM Training Pipeline

┌─────────┐ ┌───────────┐ ┌──────────┐ ┌───────────┐ ┌──────┐ │ Raw │───→│ Filtering │───→│ Tokenize │───→│ Pretrain │───→│ SFT │───→ │ Data │ │ Dedup │ │ (BPE) │ │ (Causal │ │ │ │ (Web, │ │ Quality │ │ │ │ LM loss) │ │ │ │ Books) │ │ Filtering │ │ │ │ │ │ │ └─────────┘ └───────────┘ └──────────┘ └───────────┘ └──────┘ │ ▼ ┌──────┐ │ RLHF │ │ or │ │ DPO │ └──────┘

Data Preparation

Scale: LLaMA-3 was trained on ~15 trillion tokens. GPT-4 is estimated at 10T+. For context, the English Wikipedia is ~4 billion tokens — a tiny fraction of training data.

Quality filtering matters enormously: The pipeline typically includes language detection, deduplication (exact and fuzzy via MinHash), removal of toxic/low-quality content, URL filtering, and classifier-based quality scoring. Chinchilla showed that data quality affects the scaling law constants — better data means fewer tokens needed.

Tokenization in Detail

Byte Pair Encoding (BPE) builds a vocabulary bottom-up:

Starting vocabulary: all individual characters + special tokens.
Corpus: "low low low low low lowest lowest newer newer newer wider wider"

Iteration 1: Most frequent pair: ("l", "o") → merge into "lo"
Iteration 2: Most frequent pair: ("lo", "w") → merge into "low"
Iteration 3: Most frequent pair: ("e", "r") → merge into "er"
Iteration 4: Most frequent pair: ("n", "e") → merge into "ne"
Iteration 5: Most frequent pair: ("ne", "w") → merge into "new"
...

Final vocabulary includes: "low", "er", "est", "new", "wid", etc.
"lowest" → ["low", "est"] (2 tokens instead of 6 characters)
"unprecedented" → ["un", "pre", "ced", "ent", "ed"] (model handles unseen words)

Typical vocab sizes: GPT-2: 50,257. LLaMA-2: 32,000. GPT-4: ~100,000. Larger vocabularies mean fewer tokens per text (more efficient) but a larger embedding matrix.

Distributed Training

No single GPU has enough memory or compute for modern LLMs. Training is distributed across hundreds or thousands of GPUs using multiple parallelism strategies:

Strategy	What's split	Communication
Data Parallel	Each GPU gets a different mini-batch; same model copy	Gradients are averaged across GPUs
Tensor Parallel	Weight matrices are split across GPUs	All-reduce within each layer
Pipeline Parallel	Different layers on different GPUs	Activations passed between stages
ZeRO (DeepSpeed)	Optimizer states, gradients, and/or parameters	Gather when needed for computation

Mixed Precision Training

Modern training uses BF16 (bfloat16) for forward/backward computation and FP32 for the master copy of weights. BF16 has the same exponent range as FP32 (so no overflow) but lower precision — this is fine for gradients, and it halves memory and doubles throughput.

Memory estimation for a transformer $$\text{Parameters} \approx 12 \cdot L \cdot d^2 \quad \text{(for } L \text{ layers, dimension } d\text{)}$$ $$\text{Memory (training, BF16 + Adam)} \approx 18 \times \text{params (in bytes)}$$

For a 7B parameter model: ~126 GB just for parameters + optimizer states — requiring at least 2× A100-80GB GPUs.

(a) A model has \(L = 32\) layers and \(d = 4096\). Estimate the parameter count using the formula above. How does this compare to LLaMA-7B?
(b) If training on 2T tokens with a batch size of 4M tokens, how many gradient updates are performed?
(c) Why is BF16 preferred over FP16 for LLM training? (Hint: think about the exponent range.)

· · ·

Deep Dive V

Inference — How LLMs Generate Text

Autoregressive Generation

At inference time, a decoder-only LLM generates one token at a time. At each step:

x_{t+1} \sim P(x \mid x_1, x_2, \ldots, x_t; \theta)

The new token is appended to the context, and the process repeats until an end-of-sequence token is generated or a maximum length is reached.

Sampling Strategies

Temperature

Temperature scaling $$P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$

\(T = 1\): standard softmax. \(T \to 0\): becomes argmax (greedy, deterministic). \(T > 1\): flattens the distribution (more random, more "creative"). \(T < 1\): sharpens the distribution (more focused on high-probability tokens).

Logits: \([2.0, 1.0, 0.5]\).

\(T = 1.0\): softmax → \([0.51, 0.19, 0.15, \ldots]\) (already peaked)
\(T = 0.5\): divide logits by 0.5 → \([4.0, 2.0, 1.0]\) → softmax → \([0.84, 0.11, 0.04]\) (very peaked)
\(T = 2.0\): divide logits by 2 → \([1.0, 0.5, 0.25]\) → softmax → \([0.39, 0.24, 0.18]\) (flatter)

Top-k Sampling

Only consider the \(k\) most probable tokens, redistribute probability among them:

P'(x_i) = \begin{cases} P(x_i) / \sum_{j \in \text{top-k}} P(x_j) & \text{if } x_i \in \text{top-k} \\ 0 & \text{otherwise} \end{cases}

Top-p (Nucleus) Sampling

Dynamically choose the smallest set of tokens whose cumulative probability exceeds \(p\):

\text{Find smallest } k \text{ such that } \sum_{i=1}^{k} P(x_{(i)}) \geq p

Where \(x_{(i)}\) are tokens sorted by decreasing probability. Top-p adapts — for confident predictions it considers few tokens; for uncertain ones, many.

The KV Cache — Why Inference Is Fast

Without optimization, generating token \(t\) requires recomputing the attention for all previous \(t-1\) tokens — \(O(t^2)\) per token, \(O(T^3)\) total for a sequence of length \(T\). The KV cache stores the key and value vectors from all previous tokens so we only compute the new token's query, key, and value at each step.

Without KV Cache (recompute everything): Step 1: compute K,V for [t₁] → generate t₂ Step 2: compute K,V for [t₁, t₂] → generate t₃ Step 3: compute K,V for [t₁, t₂, t₃] → generate t₄ ↑ redundant! With KV Cache: Step 1: compute K₁,V₁ → cache [K₁,V₁] → generate t₂ Step 2: compute K₂,V₂ → cache [K₁,V₁,K₂,V₂] → generate t₃ Step 3: compute K₃,V₃ → cache [K₁..₃, V₁..₃] → generate t₄ ↑ only the new token!

This reduces per-token computation from \(O(T \cdot d^2)\) to approximately \(O(d^2)\) (plus the attention computation which is \(O(T \cdot d)\)).

Memory cost: For each layer, the KV cache stores \(2 \times T \times d_k \times h\) values. For a 70B model with 128K context: this can be 10+ GB of memory just for the cache.

· · ·

Deep Dive VI

BERT vs GPT — Two Paradigms

Devlin et al. (2019) — "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Introduced masked language modeling, achieving state-of-the-art on 11 NLP benchmarks simultaneously. Showed that bidirectional context is crucial for understanding tasks.

Aspect	BERT (Encoder)	GPT (Decoder)
Architecture	Transformer encoder	Transformer decoder
Attention	Bidirectional (sees all tokens)	Causal (sees only past tokens)
Training objective	Masked Language Modeling (MLM)	Causal Language Modeling (CLM)
What it predicts	Randomly masked tokens (15%)	Next token at every position
Best for	Understanding: classification, NER, QA	Generation: text, code, dialogue
Usage pattern	Fine-tune on task-specific data	Prompt engineering / in-context learning

BERT's Masked Language Modeling

MLM Objective $$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}; \theta)$$

Where \(\mathcal{M}\) is the set of masked positions (randomly chosen 15% of tokens). For each masked position, the model sees all other tokens (bidirectionally) and predicts the original token.

Analogy: BERT is like a fill-in-the-blank test — it sees the whole sentence with some words hidden and must guess them. GPT is like an autocomplete engine — it sees the beginning and must continue. The fill-in-the-blank approach gives BERT richer bidirectional context, but GPT's left-to-right approach naturally enables generation.

Why GPT Won (for Now)

BERT excels at understanding but struggles with generation (you'd need to iteratively unmask tokens). GPT's autoregressive approach naturally generates text and, with sufficient scale, can also understand through in-context learning. The discovery that scaling decoder-only models leads to emergent capabilities (GPT-3) shifted the field toward the GPT paradigm. However, encoder models remain superior for many specialized tasks (search, classification, retrieval).

The Encoder-Decoder Hybrid: T5

Raffel et al. (2020) — "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transfer Transformer" (T5). Cast every NLP task as text-to-text: classification, translation, summarization, QA — all become "input text → output text." The encoder processes the input, the decoder generates the output.

· · ·

Deep Dive VII

Regularization & Generalization

The Core Problem: Overfitting

A model that memorizes the training data (gets 100% training accuracy) but fails on new data is overfitting. It has learned the noise in the training data, not the underlying pattern. Regularization techniques are all ways of saying: "don't learn anything too complex."

L2 Regularization (Weight Decay)

Regularized Loss $$\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2}\|\mathbf{w}\|^2 = \mathcal{L}_{\text{data}} + \frac{\lambda}{2}\sum_i w_i^2$$

The gradient update becomes: \(w \leftarrow w - \eta(\nabla_w \mathcal{L} + \lambda w) = w(1 - \eta\lambda) - \eta\nabla_w\mathcal{L}\). The term \((1 - \eta\lambda)\) shrinks the weight by a small fraction each step — hence "weight decay." This penalizes large weights, preferring simpler models.

Probabilistic interpretation: L2 regularization is equivalent to placing a Gaussian prior on the weights: \(P(w) \propto e^{-\lambda w^2/2}\). MAP (Maximum A Posteriori) estimation with this prior gives you the L2-regularized objective. So regularization = incorporating a prior belief that weights should be small.

Dropout

During training, randomly set each neuron's output to 0 with probability \(p\) (typically 0.1–0.5):

\tilde{a}_i^{[l]} = \frac{m_i}{1-p} \cdot a_i^{[l]} \quad \text{where } m_i \sim \text{Bernoulli}(1-p)

The \(\frac{1}{1-p}\) scaling (inverted dropout) ensures expected values don't change at test time. Why it works: Forces the network to learn redundant representations — no neuron can rely on any other specific neuron being present. This effectively trains an ensemble of \(2^n\) different sub-networks.

Layer Normalization

\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}$$ $$\text{where } \mu = \frac{1}{d}\sum_{i=1}^{d}x_i, \quad \sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i - \mu)^2

Normalizes across the feature dimension for each individual example. Unlike batch normalization (which normalizes across the batch), LayerNorm doesn't depend on batch size and works well for transformers and variable-length sequences.

Pre-Norm vs Post-Norm

Post-Norm (original transformer): \(\text{LayerNorm}(x + \text{sublayer}(x))\) — applies norm after the residual. Requires careful learning rate warmup.
Pre-Norm (modern standard): \(x + \text{sublayer}(\text{LayerNorm}(x))\) — applies norm before the sublayer. More stable training, dominant in modern LLMs.

· · ·

Deep Dive VIII

Optimizers — From SGD to Adam

The Evolution of Gradient Descent

SGD with Momentum

Plain SGD oscillates in narrow valleys (high curvature in one direction, low in another). Momentum adds a "velocity" term that smooths out oscillations — like a ball rolling downhill that maintains its direction.

Momentum update $$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta)\nabla_\theta \mathcal{L}$$ $$\theta_{t+1} = \theta_t - \eta \mathbf{v}_t$$

\(\beta \approx 0.9\) means the velocity is an exponential moving average of recent gradients. If the gradient consistently points in one direction, momentum accelerates; if it oscillates, momentum dampens.

RMSProp — Adaptive Learning Rates

\mathbf{s}_t = \beta_2 \mathbf{s}_{t-1} + (1-\beta_2) (\nabla\mathcal{L})^2$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\mathbf{s}_t} + \epsilon} \nabla\mathcal{L}

Parameters with large gradients get smaller effective learning rates; parameters with small gradients get larger ones. This is crucial when different parameters have very different gradient magnitudes.

Adam — The Standard

Combines momentum (first moment) with RMSProp (second moment), plus bias correction:

Adam optimizer $$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla\mathcal{L} \quad \text{(first moment — mean)}$$ $$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla\mathcal{L})^2 \quad \text{(second moment — variance)}$$ $$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \quad \text{(bias correction)}$$ $$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t$$

Default hyperparameters: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\), \(\eta = 3 \times 10^{-4}\).

Why bias correction? Both \(\mathbf{m}\) and \(\mathbf{v}\) are initialized to 0. In early steps, they're biased toward 0 because the exponential average hasn't accumulated enough history. Dividing by \((1-\beta^t)\) corrects for this: when \(t=1\) and \(\beta_1=0.9\), the correction factor is \(1/(1-0.9) = 10\), which exactly compensates for the initialization bias.

AdamW — Decoupled Weight Decay

\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda\theta_t\right)

Standard Adam applies L2 regularization to the gradient before the adaptive scaling, which weakens its effect. AdamW applies weight decay directly to the parameters, which is mathematically cleaner and empirically better. AdamW is the standard optimizer for modern LLM training.

Learning Rate Scheduling

Most LLM training uses a combination of warmup and cosine decay:

\eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & t \leq T_{\text{warmup}} \\[6pt] \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi(t - T_{\text{warmup}})}{T - T_{\text{warmup}}}\right) & t > T_{\text{warmup}}\end{cases}

Warmup (typically 1–5% of training) gradually increases the learning rate from near-zero to prevent early instability. Cosine decay smoothly reduces it to near-zero by the end of training.

(a) In Adam, what does a large \(\hat{\mathbf{v}}_t\) value for a particular parameter mean? How does this affect the effective learning rate for that parameter?
(b) Why is warmup especially important for transformers but less critical for CNNs?
(c) If you're training a 7B model for 2T tokens with batch size 4M tokens and warmup ratio 2%, how many warmup steps is that?

· · ·

Deep Dive IX

Embeddings — The Geometry of Meaning

How Word Arithmetic Works

The famous equation: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\)

The word embedding space has learned that there's a direction that represents "gender." The vector \(\vec{\text{king}} - \vec{\text{man}}\) isolates the "royalty" component by subtracting out "maleness." Adding \(\vec{\text{woman}}\) puts "femaleness" back in, landing near "queen." This works because the embedding space has linear structure — abstract concepts are encoded as consistent directions.

Mathematically, this works because the training objective (predict context words) forces the dot product \(\mathbf{u}_o^T \mathbf{v}_c\) to capture co-occurrence statistics, and co-occurrence patterns reflect semantic and syntactic relationships. The result: semantic relationships become geometric relationships.

Static vs Contextual Embeddings

Aspect	Static (Word2Vec, GloVe)	Contextual (BERT, GPT)
One word =	One fixed vector	Different vector per context
"bank" in "river bank"	Same vector	Different vector (near "river", "shore")
"bank" in "bank account"	Same vector	Different vector (near "money", "finance")
Trained on	Co-occurrence statistics	Full language modeling objective
Captures	Type-level semantics	Token-level semantics (word sense)

Contextual embeddings are what make modern NLP so powerful. Each token's representation is computed by the transformer as a function of the entire surrounding context. The same word gets different representations depending on meaning, part of speech, and role in the sentence.

Sentence and Document Embeddings

To get a single vector for an entire sentence, common approaches include: mean-pooling all token embeddings (simple but effective), using the [CLS] token embedding from BERT, or using specialized models trained with contrastive objectives (like Sentence-BERT, E5, or BGE). These are used for semantic search, clustering, and retrieval-augmented generation (RAG).

· · ·

Deep Dive X

Interpretability — What Do Models Actually Learn?

Elhage et al. (2021–2024) — Anthropic's mechanistic interpretability work ("A Mathematical Framework for Transformer Circuits", "Toy Models of Superposition", "Scaling Monosemanticity"). Pioneering work on reverse-engineering what individual neurons and circuits in transformers actually compute.

Attention Visualization

By examining attention weight matrices, we can see what each token "looks at." Early layers tend to show local attention patterns (nearby tokens). Middle layers often show syntactic patterns (subject attending to its verb). Late layers show more diffuse, task-specific patterns.

However, attention weights alone don't tell the full story — what matters is the combination of attention weights and value vectors. A head can have high attention weight on a token but project it through a near-zero value, effectively ignoring it.

Probing Classifiers

To test whether a model has learned a concept (e.g., part of speech, syntactic tree depth, entity type), we freeze the model and train a simple linear classifier on its hidden states. If a linear probe achieves high accuracy, the information is linearly encoded in the representation — meaning the model has learned to organize its internal representations so that this information is easily extractable.

Superposition — The Key Challenge

A model with 4096 dimensions might need to represent far more than 4096 concepts. Superposition is the phenomenon where models encode more features than they have dimensions, by using near-orthogonal directions. If two features rarely occur together, they can share the same dimensions without interfering.

This is why individual neurons are often "polysemantic" — a single neuron might activate for both "academic citations" and "references in code." The true "features" are not individual neurons but directions in activation space, and there may be millions of them encoded in thousands of dimensions.

Circuits — How Computation Happens

A circuit is a subgraph of the model that implements a specific behavior. For example, "induction heads" are a two-layer circuit discovered in transformers:

Layer 1 head: Copies previous token information into the current position ("previous-token head").
Layer 2 head: Attends to positions where the previous token matches the current token ("induction head").

Together, they implement the pattern: "if the sequence [A][B]....[A] has occurred, predict [B] will come next." This is a fundamental mechanism for in-context learning.

The Residual Stream View
A powerful way to think about transformers: the residual connections create a "residual stream" — a \(d\)-dimensional information highway that flows from input to output. Each attention head and MLP layer reads from and writes to this stream. The final prediction is made by the output logit layer reading from the final residual stream. This view makes it natural to think about which components contribute what information to the final prediction.

Sparse Autoencoders (SAEs) — Finding Features

Recent work has shown that training sparse autoencoders on model activations can decompose polysemantic neurons into interpretable, monosemantic features. The autoencoder learns a dictionary of directions in activation space, where each direction corresponds to an interpretable concept (a specific topic, a syntactic role, a type of reasoning).

(a) If a linear probe for "is this token a verb?" achieves 95% accuracy on BERT's layer 6 representations but only 60% on layer 1, what does this tell you about how linguistic knowledge develops through the layers?
(b) An induction head implements "[A][B]...[A] → predict [B]." How could this mechanism help with in-context learning when the model is given examples like "cat→chat, dog→chien, fish→"?
(c) If a model has 4096 dimensions but uses superposition to encode 100,000 features, approximately how many features can be active simultaneously without interference? (Hint: think about near-orthogonal vectors in high dimensions.)

· · ·

Synthesis

Connecting Everything

The Grand Picture

┌────────────────────────────────────────────────────┐ │ THE LLM KNOWLEDGE STACK │ ├────────────────────────────────────────────────────┤ │ │ │ ALIGNMENT (RLHF / DPO) │ │ └─ Objective: match human preferences │ │ └─ Math: reward model + KL-constrained RL │ │ │ │ ARCHITECTURE (Transformer) │ │ └─ Core: self-attention + FFN + residuals │ │ └─ Math: softmax(QK^T/√d)V │ │ └─ Innovations: RoPE, GQA, MoE, SwiGLU │ │ │ │ TRAINING OBJECTIVE (Next-token prediction) │ │ └─ Loss: cross-entropy = negative log-likelihood │ │ └─ Optimizer: AdamW + cosine LR schedule │ │ └─ Distributed: data + tensor + pipeline parallel│ │ │ │ REPRESENTATIONS (Embeddings) │ │ └─ Tokens → dense vectors │ │ └─ Contextual: each layer refines meaning │ │ └─ Similarity = dot product in embedding space │ │ │ │ LEARNING ENGINE (Backpropagation + SGD) │ │ └─ Chain rule through computational graph │ │ └─ Gradient ∝ (error × input feature) │ │ │ │ FOUNDATIONS (Linear Algebra + Probability) │ │ └─ Dot product = similarity │ │ └─ Matrix multiply = learned transformation │ │ └─ MLE = principled loss function derivation │ │ │ └────────────────────────────────────────────────────┘

Every layer of this stack builds on the one below. The math of dot products enables attention. Attention enables contextual representations. Contextual representations enable next-token prediction. Next-token prediction at scale produces emergent understanding. Alignment steers that understanding toward being helpful.

Your Path to Reading Research Papers

Now that you have the mathematical foundations, here's how to approach a new research paper:

1. Read the abstract and introduction — understand the problem being solved and the claim being made.

2. Look at figures first — architecture diagrams and result tables give you the high-level picture before you dive into equations.

3. Map new notation to what you know — Most papers use slight variations of the same constructs: attention, FFN, residual connections, loss functions. Identify which building blocks they use.

4. Focus on what's new — Skip the background sections. Look for the one or two novel equations or architectural choices that distinguish this paper.

5. Check the ablations — The ablation study tells you which parts of the method actually matter. This is often more informative than the main results.

6. Ask "what would I do differently?" — This is how you develop research taste. Every paper makes tradeoffs — understanding them is the sign of deep comprehension.

"In mathematics you don't understand things. You just get used to them." — John von Neumann

The same is true for ML papers. Read them repeatedly. Implement the math. The understanding follows.

ML First Principles, Part II:From Equations to Working Code

Contents

Complete Backpropagation Walkthrough

Network Setup

Step 1: Forward Pass — Hidden Layer

Step 2: Forward Pass — Output Layer

Step 3: Compute Loss

Step 4: Backward Pass — Output Layer Error

Step 5: Gradients for Output Weights

Step 6: Propagate Error to Hidden Layer

Step 7: Gradients for Hidden Weights

Step 8: Update All Weights (\(\eta = 0.5\))

Transformer — Complete Numerical Trace

Setup

Step 1: Token Embedding

Step 2: Positional Encoding (simplified)

Step 3: Compute Q, K, V

Step 4: Attention Scores

Step 5: Causal Mask

Step 6: Weighted Sum of Values

Step 7: Residual + LayerNorm + FFN + Residual + LayerNorm

Step 8: Final Prediction

Implementation Patterns

From-Scratch in NumPy: 2-Layer Network

Self-Attention from Scratch

Common Bugs & Fixes

The LLM Training Pipeline

Data Preparation

Tokenization in Detail

Distributed Training

Mixed Precision Training

Inference — How LLMs Generate Text

Autoregressive Generation

Sampling Strategies

Temperature

Top-k Sampling

Top-p (Nucleus) Sampling

The KV Cache — Why Inference Is Fast

BERT vs GPT — Two Paradigms

BERT's Masked Language Modeling

Why GPT Won (for Now)

The Encoder-Decoder Hybrid: T5

Regularization & Generalization

The Core Problem: Overfitting

L2 Regularization (Weight Decay)

Dropout

Layer Normalization

Pre-Norm vs Post-Norm

Optimizers — From SGD to Adam

The Evolution of Gradient Descent

SGD with Momentum

RMSProp — Adaptive Learning Rates

Adam — The Standard

AdamW — Decoupled Weight Decay

Learning Rate Scheduling

Embeddings — The Geometry of Meaning

How Word Arithmetic Works

Static vs Contextual Embeddings

Sentence and Document Embeddings

Interpretability — What Do Models Actually Learn?

Attention Visualization

Probing Classifiers

Superposition — The Key Challenge

Circuits — How Computation Happens

Sparse Autoencoders (SAEs) — Finding Features

Connecting Everything

The Grand Picture

Your Path to Reading Research Papers

ML First Principles, Part II:
From Equations to Working Code