Companion Volume β Deep Dives & Implementation
ML First Principles, Part II:
From Equations to Working Code
Extended worked examples, implementation patterns, the training pipeline in practice,
inference mechanics, BERT vs GPT, and the topics that bridge theory to engineering.
We will trace every single computation through a small network β forward and backward β with exact numbers. No hand-waving. When you finish this section, you should be able to compute gradients for any network on paper.
Network Setup
INPUT HIDDEN LAYER OUTPUT
(ReLU) (Sigmoid + BCE)
xβ = 1.0 βββ ββββββ
β² β hβ βββββ²
xβ = 0.5 βββ ββββββ β² ββββββ
β± ββββ oβ ββββ Ε· βββ Loss
xβ = 1.0 βββ ββββββ β± ββββββ
β² β hβ βββββ±
xβ = 0.5 βββ ββββββ
Weights:
WΒΉ = [[0.15, 0.20], bΒΉ = [0.35, 0.35]
[0.25, 0.30]]
WΒ² = [[0.40, 0.45]] bΒ² = [0.60]
True label: y = 1
Step 1: Forward Pass β Hidden Layer
Compute pre-activation \(\mathbf{z}^{[1]} = \mathbf{W}^{[1]}\mathbf{x} + \mathbf{b}^{[1]}\):
$$z_1^{[1]} = 0.15(1.0) + 0.20(0.5) + 0.35 = 0.15 + 0.10 + 0.35 = 0.60$$
$$z_2^{[1]} = 0.25(1.0) + 0.30(0.5) + 0.35 = 0.25 + 0.15 + 0.35 = 0.75$$
Apply ReLU activation \(\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]})\):
$$a_1^{[1]} = \max(0, 0.60) = 0.60 \qquad a_2^{[1]} = \max(0, 0.75) = 0.75$$
Step 2: Forward Pass β Output Layer
Compute pre-activation and sigmoid:
$$z^{[2]} = 0.40(0.60) + 0.45(0.75) + 0.60 = 0.24 + 0.3375 + 0.60 = 1.1775$$
$$\hat{y} = a^{[2]} = \sigma(1.1775) = \frac{1}{1 + e^{-1.1775}} = \frac{1}{1 + 0.3083} = 0.7644$$
Step 3: Compute Loss
Binary cross-entropy with \(y = 1\):
$$\mathcal{L} = -[y \log(\hat{y}) + (1-y)\log(1-\hat{y})] = -\log(0.7644) = 0.2686$$
Step 4: Backward Pass β Output Layer Error
The key identity for sigmoid + BCE: \(\delta^{[2]} = \hat{y} - y\):
Output error signal
$$\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial z^{[2]}} = \hat{y} - y = 0.7644 - 1 = -0.2356$$
Let's prove the elegant identity \(\delta = \hat{y} - y\) from scratch.
We need \(\frac{\partial \mathcal{L}}{\partial z}\). By the chain rule:
\(\frac{\partial \mathcal{L}}{\partial z} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z}\)
Term 1: \(\frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}}[-y\log\hat{y} - (1-y)\log(1-\hat{y})] = -\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\)
Term 2: \(\frac{\partial \hat{y}}{\partial z} = \sigma(z)(1 - \sigma(z)) = \hat{y}(1 - \hat{y})\)
Combined:
\(\frac{\partial \mathcal{L}}{\partial z} = \left(-\frac{y}{\hat{y}} + \frac{1-y}{1-\hat{y}}\right) \cdot \hat{y}(1-\hat{y})\)
\(= -y(1-\hat{y}) + (1-y)\hat{y}\)
\(= -y + y\hat{y} + \hat{y} - y\hat{y}\)
\(= \hat{y} - y \quad \blacksquare\)
This cancellation is not a coincidence. It arises because the sigmoid is the natural link function for the Bernoulli distribution β this is from the theory of exponential family distributions and generalized linear models.
Step 5: Gradients for Output Weights
Output weight gradients
$$\frac{\partial \mathcal{L}}{\partial W_{1}^{[2]}} = \delta^{[2]} \cdot a_1^{[1]} = (-0.2356)(0.60) = -0.1414$$
$$\frac{\partial \mathcal{L}}{\partial W_{2}^{[2]}} = \delta^{[2]} \cdot a_2^{[1]} = (-0.2356)(0.75) = -0.1767$$
$$\frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]} = -0.2356$$
Step 6: Propagate Error to Hidden Layer
Error propagation through weights
$$\delta_1^{[1]\text{(pre-ReLU)}} = W_1^{[2]} \cdot \delta^{[2]} = 0.40 \times (-0.2356) = -0.0943$$
$$\delta_2^{[1]\text{(pre-ReLU)}} = W_2^{[2]} \cdot \delta^{[2]} = 0.45 \times (-0.2356) = -0.1060$$
Apply ReLU derivative: since both \(z_1^{[1]} = 0.60 > 0\) and \(z_2^{[1]} = 0.75 > 0\), the ReLU derivative is 1 for both:
$$\delta_1^{[1]} = -0.0943 \times 1 = -0.0943$$
$$\delta_2^{[1]} = -0.1060 \times 1 = -0.1060$$
Step 7: Gradients for Hidden Weights
Hidden weight gradients
$$\frac{\partial \mathcal{L}}{\partial W_{11}^{[1]}} = \delta_1^{[1]} \cdot x_1 = (-0.0943)(1.0) = -0.0943$$
$$\frac{\partial \mathcal{L}}{\partial W_{12}^{[1]}} = \delta_1^{[1]} \cdot x_2 = (-0.0943)(0.5) = -0.0471$$
$$\frac{\partial \mathcal{L}}{\partial W_{21}^{[1]}} = \delta_2^{[1]} \cdot x_1 = (-0.1060)(1.0) = -0.1060$$
$$\frac{\partial \mathcal{L}}{\partial W_{22}^{[1]}} = \delta_2^{[1]} \cdot x_2 = (-0.1060)(0.5) = -0.0530$$
Step 8: Update All Weights (\(\eta = 0.5\))
Since \(\delta\) was negative (model under-predicted β \(\hat{y} = 0.76 < 1\)), all gradients are negative, so all weights increase. This makes the output larger, pushing \(\hat{y}\) closer to 1.
Updated weights
$$W_1^{[2]}: \; 0.40 - 0.5(-0.1414) = 0.40 + 0.0707 = 0.4707$$
$$W_2^{[2]}: \; 0.45 - 0.5(-0.1767) = 0.45 + 0.0884 = 0.5384$$
$$b^{[2]}: \; 0.60 - 0.5(-0.2356) = 0.60 + 0.1178 = 0.7178$$
$$W_{11}^{[1]}: \; 0.15 - 0.5(-0.0943) = 0.15 + 0.0471 = 0.1971$$
$$\text{...and similarly for all hidden weights}$$
The backprop recipe, summarized:
1. Forward pass: compute all \(\mathbf{z}^{[l]}\) and \(\mathbf{a}^{[l]}\), caching each one.
2. Output error: \(\delta^{[L]} = \hat{y} - y\) (for sigmoid + BCE).
3. For each layer \(l = L-1, L-2, \ldots, 1\):
β Propagate: \(\boldsymbol{\delta}^{[l]} = (\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]} \odot \phi'(\mathbf{z}^{[l]})\)
β Gradients: \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T\)
4. Update: \(\mathbf{W}^{[l]} \leftarrow \mathbf{W}^{[l]} - \eta \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}}\)
Β· Β· Β·
Let's trace a minimal transformer decoder through one forward pass, computing every matrix multiplication. We'll use a vocabulary of 5 tokens, embedding dimension \(d = 4\), one attention head, and a sequence of 3 tokens.
Setup
Vocabulary: {0: "the", 1: "cat", 2: "sat", 3: "on", 4: "mat"}.
Input sequence: ["the", "cat", "sat"] β token IDs [0, 1, 2].
Target: predict the next token at each position.
Step 1: Token Embedding
Embedding matrix \(\mathbf{E} \in \mathbb{R}^{5 \times 4}\) (lookup table):
$$\mathbf{E} = \begin{bmatrix}
0.2 & 0.1 & -0.3 & 0.5 \\
-0.1 & 0.4 & 0.2 & -0.2 \\
0.3 & -0.2 & 0.1 & 0.4 \\
0.1 & 0.3 & -0.1 & 0.2 \\
-0.3 & 0.2 & 0.4 & -0.1
\end{bmatrix}$$
Look up rows 0, 1, 2 to get \(\mathbf{X} \in \mathbb{R}^{3 \times 4}\):
$$\mathbf{X} = \begin{bmatrix}
0.2 & 0.1 & -0.3 & 0.5 \\
-0.1 & 0.4 & 0.2 & -0.2 \\
0.3 & -0.2 & 0.1 & 0.4
\end{bmatrix}$$
Step 2: Positional Encoding (simplified)
Add position vectors (using simple learned encodings for clarity):
$$\mathbf{P} = \begin{bmatrix}
0.0 & 0.0 & 0.1 & 0.0 \\
0.0 & 0.1 & 0.0 & 0.0 \\
0.1 & 0.0 & 0.0 & 0.0
\end{bmatrix} \qquad \mathbf{X}' = \mathbf{X} + \mathbf{P}$$
Step 3: Compute Q, K, V
Using projection matrices \(\mathbf{W}^Q, \mathbf{W}^K, \mathbf{W}^V \in \mathbb{R}^{4 \times 4}\). For simplicity, let them be identity-like with small perturbations (in practice they're learned):
$$\mathbf{Q} = \mathbf{X}'\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}'\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}'\mathbf{W}^V$$
Suppose after projection we get (rounded):
$$\mathbf{Q} = \begin{bmatrix}0.22 & 0.12 & -0.18 & 0.48\\-0.08 & 0.50 & 0.18 & -0.22\\0.38 & -0.18 & 0.12 & 0.42\end{bmatrix}$$
Step 4: Attention Scores
Compute \(\mathbf{S} = \mathbf{Q}\mathbf{K}^T / \sqrt{d_k}\) where \(d_k = 4\), so divide by 2:
$$\mathbf{S}_{\text{raw}} = \mathbf{Q}\mathbf{K}^T = \begin{bmatrix}
s_{00} & s_{01} & s_{02} \\
s_{10} & s_{11} & s_{12} \\
s_{20} & s_{21} & s_{22}
\end{bmatrix}$$
For example, \(s_{00} = \mathbf{q}_0^T \mathbf{k}_0\) measures how much "the" (as query) attends to "the" (as key).
Step 5: Causal Mask
Apply the causal mask β token \(i\) can only attend to positions \(\leq i\):
$$\mathbf{M} = \begin{bmatrix}0 & -\infty & -\infty \\ 0 & 0 & -\infty \\ 0 & 0 & 0\end{bmatrix}$$
$$\mathbf{S}_{\text{masked}} = \mathbf{S}/2 + \mathbf{M}$$
After masking, the first row has only one valid score (position 0); second row has two valid scores; third row has all three. Then we apply softmax row-wise:
$$\boldsymbol{\alpha} = \text{softmax}(\mathbf{S}_{\text{masked}}) = \begin{bmatrix}
1.00 & 0.00 & 0.00 \\
0.48 & 0.52 & 0.00 \\
0.30 & 0.28 & 0.42
\end{bmatrix}$$
(Example values.) Row 1: "the" can only see itself β weight = 1.0. Row 2: "cat" attends roughly equally to "the" and "cat". Row 3: "sat" attends to all three tokens, with slightly more weight on itself.
Step 6: Weighted Sum of Values
$$\text{Output} = \boldsymbol{\alpha} \mathbf{V}$$
Each row of the output is a weighted combination of value vectors, where the weights came from the attention computation. This is the "new representation" of each token, enriched by information from tokens it attended to.
Step 7: Residual + LayerNorm + FFN + Residual + LayerNorm
$$\mathbf{H}' = \text{LayerNorm}(\mathbf{X}' + \text{Attention}(\mathbf{X}'))$$
$$\mathbf{H}'' = \text{LayerNorm}(\mathbf{H}' + \text{FFN}(\mathbf{H}'))$$
Step 8: Final Prediction
Project last layer output to vocabulary size and apply softmax:
$$\text{logits} = \mathbf{H}'' \mathbf{E}^T \in \mathbb{R}^{3 \times 5}$$
$$P(\text{next token} \mid \text{context}) = \text{softmax}(\text{logits})$$
For position 2 ("sat"), the model outputs a probability distribution over 5 tokens. If it assigns high probability to token 3 ("on"), that's correct β "the cat sat on."
In practice, the loss is computed at every position simultaneously:
Target tokens are [1, 2, 3] (shifted right: predict "cat" from "the", "sat" from "the cat", "on" from "the cat sat"). The loss is the average cross-entropy across all positions. This is extremely efficient β one forward pass gives you \(T-1\) training signals from a sequence of length \(T\).
Β· Β· Β·
From-Scratch in NumPy: 2-Layer Network
import numpy as np
X = np.array([[0, 0], [0, 1], [1, 0], [1, 1]])
y = np.array([[0], [1], [1], [0]])
np.random.seed(42)
W1 = np.random.randn(2, 4) * np.sqrt(2.0 / 2)
b1 = np.zeros((1, 4))
W2 = np.random.randn(4, 1) * np.sqrt(2.0 / 4)
b2 = np.zeros((1, 1))
def sigmoid(z): return 1 / (1 + np.exp(-z))
def relu(z): return np.maximum(0, z)
lr = 0.5
for epoch in range(10000):
z1 = X @ W1 + b1
a1 = relu(z1)
z2 = a1 @ W2 + b2
a2 = sigmoid(z2)
loss = -np.mean(y * np.log(a2 + 1e-8) + (1-y) * np.log(1-a2 + 1e-8))
dz2 = a2 - y
dW2 = a1.T @ dz2 / 4
db2 = np.mean(dz2, axis=0, keepdims=True)
da1 = dz2 @ W2.T
dz1 = da1 * (z1 > 0).astype(float)
dW1 = X.T @ dz1 / 4
db1 = np.mean(dz1, axis=0, keepdims=True)
W2 -= lr * dW2; b2 -= lr * db2
W1 -= lr * dW1; b1 -= lr * db1
if epoch % 2000 == 0:
print(f"Epoch {epoch}: loss = {loss:.4f}")
print("Predictions:", a2.round(2).flatten())
Key patterns to notice in the code:
1. Forward pass caches intermediate values (\(z_1, a_1, z_2\)) β backprop needs them.
2. Backward pass mirrors forward pass in reverse order.
3. Matrix shapes tell the story: \(dW = \text{input}^T @ \text{error}\) β the gradient for a weight matrix is always (previous layer activations)α΅ Γ (current layer error).
4. He initialization (\(\times \sqrt{2/n}\)) prevents activations from vanishing or exploding at the start.
Self-Attention from Scratch
import numpy as np
def softmax(x, axis=-1):
e = np.exp(x - x.max(axis=axis, keepdims=True))
return e / e.sum(axis=axis, keepdims=True)
def self_attention(X, Wq, Wk, Wv, mask=None):
"""
X: (seq_len, d_model)
Wq/k/v: (d_model, d_k)
mask: (seq_len, seq_len) β causal mask
"""
Q = X @ Wq
K = X @ Wk
V = X @ Wv
d_k = Q.shape[-1]
scores = Q @ K.T / np.sqrt(d_k)
if mask is not None:
scores = scores + mask
weights = softmax(scores)
output = weights @ V
return output, weights
seq_len = 4
mask = np.triu(np.full((seq_len, seq_len), -np.inf), k=1)
Common Bugs & Fixes
| Bug | Symptom | Fix |
| Forgetting to zero gradients | Loss increases over time | optimizer.zero_grad() before each backward |
| Wrong axis in softmax | Attention weights don't sum to 1 per row | Softmax along the keys axis (last dim) |
| Missing numerical stability | NaN in softmax or log | Subtract max before exp; add \(\epsilon\) inside log |
| Transposed weight matrix | Shape mismatch | Remember: \(\text{output} = \text{input} \times W\), so \(W \in \mathbb{R}^{d_{\text{in}} \times d_{\text{out}}}\) |
| Not masking padding tokens | Model attends to garbage | Set padding positions to \(-\infty\) before softmax |
| Learning rate too high | Loss oscillates or explodes | Start with \(3 \times 10^{-4}\) for Adam; use warmup |
Β· Β· Β·
βββββββββββ βββββββββββββ ββββββββββββ βββββββββββββ ββββββββ
β Raw ββββββ Filtering ββββββ Tokenize ββββββ Pretrain ββββββ SFT βββββ
β Data β β Dedup β β (BPE) β β (Causal β β β
β (Web, β β Quality β β β β LM loss) β β β
β Books) β β Filtering β β β β β β β
βββββββββββ βββββββββββββ ββββββββββββ βββββββββββββ ββββββββ
β
βΌ
ββββββββ
β RLHF β
β or β
β DPO β
ββββββββ
Data Preparation
Scale: LLaMA-3 was trained on ~15 trillion tokens. GPT-4 is estimated at 10T+. For context, the English Wikipedia is ~4 billion tokens β a tiny fraction of training data.
Quality filtering matters enormously: The pipeline typically includes language detection, deduplication (exact and fuzzy via MinHash), removal of toxic/low-quality content, URL filtering, and classifier-based quality scoring. Chinchilla showed that data quality affects the scaling law constants β better data means fewer tokens needed.
Tokenization in Detail
Byte Pair Encoding (BPE) builds a vocabulary bottom-up:
Starting vocabulary: all individual characters + special tokens.
Corpus: "low low low low low lowest lowest newer newer newer wider wider"
Iteration 1: Most frequent pair: ("l", "o") β merge into "lo"
Iteration 2: Most frequent pair: ("lo", "w") β merge into "low"
Iteration 3: Most frequent pair: ("e", "r") β merge into "er"
Iteration 4: Most frequent pair: ("n", "e") β merge into "ne"
Iteration 5: Most frequent pair: ("ne", "w") β merge into "new"
...
Final vocabulary includes: "low", "er", "est", "new", "wid", etc.
"lowest" β ["low", "est"] (2 tokens instead of 6 characters)
"unprecedented" β ["un", "pre", "ced", "ent", "ed"] (model handles unseen words)
Typical vocab sizes: GPT-2: 50,257. LLaMA-2: 32,000. GPT-4: ~100,000. Larger vocabularies mean fewer tokens per text (more efficient) but a larger embedding matrix.
Distributed Training
No single GPU has enough memory or compute for modern LLMs. Training is distributed across hundreds or thousands of GPUs using multiple parallelism strategies:
| Strategy | What's split | Communication |
| Data Parallel | Each GPU gets a different mini-batch; same model copy | Gradients are averaged across GPUs |
| Tensor Parallel | Weight matrices are split across GPUs | All-reduce within each layer |
| Pipeline Parallel | Different layers on different GPUs | Activations passed between stages |
| ZeRO (DeepSpeed) | Optimizer states, gradients, and/or parameters | Gather when needed for computation |
Mixed Precision Training
Modern training uses BF16 (bfloat16) for forward/backward computation and FP32 for the master copy of weights. BF16 has the same exponent range as FP32 (so no overflow) but lower precision β this is fine for gradients, and it halves memory and doubles throughput.
Memory estimation for a transformer
$$\text{Parameters} \approx 12 \cdot L \cdot d^2 \quad \text{(for } L \text{ layers, dimension } d\text{)}$$
$$\text{Memory (training, BF16 + Adam)} \approx 18 \times \text{params (in bytes)}$$
For a 7B parameter model: ~126 GB just for parameters + optimizer states β requiring at least 2Γ A100-80GB GPUs.
(a) A model has \(L = 32\) layers and \(d = 4096\). Estimate the parameter count using the formula above. How does this compare to LLaMA-7B?
(b) If training on 2T tokens with a batch size of 4M tokens, how many gradient updates are performed?
(c) Why is BF16 preferred over FP16 for LLM training? (Hint: think about the exponent range.)
Β· Β· Β·
Autoregressive Generation
At inference time, a decoder-only LLM generates one token at a time. At each step:
$$x_{t+1} \sim P(x \mid x_1, x_2, \ldots, x_t; \theta)$$
The new token is appended to the context, and the process repeats until an end-of-sequence token is generated or a maximum length is reached.
Sampling Strategies
Temperature
Temperature scaling
$$P(x_i) = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
\(T = 1\): standard softmax. \(T \to 0\): becomes argmax (greedy, deterministic). \(T > 1\): flattens the distribution (more random, more "creative"). \(T < 1\): sharpens the distribution (more focused on high-probability tokens).
Logits: \([2.0, 1.0, 0.5]\).
\(T = 1.0\): softmax β \([0.51, 0.19, 0.15, \ldots]\) (already peaked)
\(T = 0.5\): divide logits by 0.5 β \([4.0, 2.0, 1.0]\) β softmax β \([0.84, 0.11, 0.04]\) (very peaked)
\(T = 2.0\): divide logits by 2 β \([1.0, 0.5, 0.25]\) β softmax β \([0.39, 0.24, 0.18]\) (flatter)
Top-k Sampling
Only consider the \(k\) most probable tokens, redistribute probability among them:
$$P'(x_i) = \begin{cases} P(x_i) / \sum_{j \in \text{top-k}} P(x_j) & \text{if } x_i \in \text{top-k} \\ 0 & \text{otherwise} \end{cases}$$
Top-p (Nucleus) Sampling
Dynamically choose the smallest set of tokens whose cumulative probability exceeds \(p\):
$$\text{Find smallest } k \text{ such that } \sum_{i=1}^{k} P(x_{(i)}) \geq p$$
Where \(x_{(i)}\) are tokens sorted by decreasing probability. Top-p adapts β for confident predictions it considers few tokens; for uncertain ones, many.
The KV Cache β Why Inference Is Fast
Without optimization, generating token \(t\) requires recomputing the attention for all previous \(t-1\) tokens β \(O(t^2)\) per token, \(O(T^3)\) total for a sequence of length \(T\). The KV cache stores the key and value vectors from all previous tokens so we only compute the new token's query, key, and value at each step.
Without KV Cache (recompute everything):
Step 1: compute K,V for [tβ] β generate tβ
Step 2: compute K,V for [tβ, tβ] β generate tβ
Step 3: compute K,V for [tβ, tβ, tβ] β generate tβ
β redundant!
With KV Cache:
Step 1: compute Kβ,Vβ β cache [Kβ,Vβ] β generate tβ
Step 2: compute Kβ,Vβ β cache [Kβ,Vβ,Kβ,Vβ] β generate tβ
Step 3: compute Kβ,Vβ β cache [Kβ..β, Vβ..β] β generate tβ
β only the new token!
This reduces per-token computation from \(O(T \cdot d^2)\) to approximately \(O(d^2)\) (plus the attention computation which is \(O(T \cdot d)\)).
Memory cost: For each layer, the KV cache stores \(2 \times T \times d_k \times h\) values. For a 70B model with 128K context: this can be 10+ GB of memory just for the cache.
Β· Β· Β·
Devlin et al. (2019) β "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding." Introduced masked language modeling, achieving state-of-the-art on 11 NLP benchmarks simultaneously. Showed that bidirectional context is crucial for understanding tasks.
| Aspect | BERT (Encoder) | GPT (Decoder) |
| Architecture | Transformer encoder | Transformer decoder |
| Attention | Bidirectional (sees all tokens) | Causal (sees only past tokens) |
| Training objective | Masked Language Modeling (MLM) | Causal Language Modeling (CLM) |
| What it predicts | Randomly masked tokens (15%) | Next token at every position |
| Best for | Understanding: classification, NER, QA | Generation: text, code, dialogue |
| Usage pattern | Fine-tune on task-specific data | Prompt engineering / in-context learning |
BERT's Masked Language Modeling
MLM Objective
$$\mathcal{L}_{\text{MLM}} = -\sum_{i \in \mathcal{M}} \log P(x_i \mid \mathbf{x}_{\setminus \mathcal{M}}; \theta)$$
Where \(\mathcal{M}\) is the set of masked positions (randomly chosen 15% of tokens). For each masked position, the model sees all other tokens (bidirectionally) and predicts the original token.
Analogy: BERT is like a fill-in-the-blank test β it sees the whole sentence with some words hidden and must guess them. GPT is like an autocomplete engine β it sees the beginning and must continue. The fill-in-the-blank approach gives BERT richer bidirectional context, but GPT's left-to-right approach naturally enables generation.
Why GPT Won (for Now)
BERT excels at understanding but struggles with generation (you'd need to iteratively unmask tokens). GPT's autoregressive approach naturally generates text and, with sufficient scale, can also understand through in-context learning. The discovery that scaling decoder-only models leads to emergent capabilities (GPT-3) shifted the field toward the GPT paradigm. However, encoder models remain superior for many specialized tasks (search, classification, retrieval).
The Encoder-Decoder Hybrid: T5
Raffel et al. (2020) β "Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transfer Transformer" (T5). Cast every NLP task as text-to-text: classification, translation, summarization, QA β all become "input text β output text." The encoder processes the input, the decoder generates the output.
Β· Β· Β·
The Core Problem: Overfitting
A model that memorizes the training data (gets 100% training accuracy) but fails on new data is overfitting. It has learned the noise in the training data, not the underlying pattern. Regularization techniques are all ways of saying: "don't learn anything too complex."
L2 Regularization (Weight Decay)
Regularized Loss
$$\mathcal{L}_{\text{reg}} = \mathcal{L}_{\text{data}} + \frac{\lambda}{2}\|\mathbf{w}\|^2 = \mathcal{L}_{\text{data}} + \frac{\lambda}{2}\sum_i w_i^2$$
The gradient update becomes: \(w \leftarrow w - \eta(\nabla_w \mathcal{L} + \lambda w) = w(1 - \eta\lambda) - \eta\nabla_w\mathcal{L}\). The term \((1 - \eta\lambda)\) shrinks the weight by a small fraction each step β hence "weight decay." This penalizes large weights, preferring simpler models.
Probabilistic interpretation: L2 regularization is equivalent to placing a Gaussian prior on the weights: \(P(w) \propto e^{-\lambda w^2/2}\). MAP (Maximum A Posteriori) estimation with this prior gives you the L2-regularized objective. So regularization = incorporating a prior belief that weights should be small.
Dropout
During training, randomly set each neuron's output to 0 with probability \(p\) (typically 0.1β0.5):
$$\tilde{a}_i^{[l]} = \frac{m_i}{1-p} \cdot a_i^{[l]} \quad \text{where } m_i \sim \text{Bernoulli}(1-p)$$
The \(\frac{1}{1-p}\) scaling (inverted dropout) ensures expected values don't change at test time. Why it works: Forces the network to learn redundant representations β no neuron can rely on any other specific neuron being present. This effectively trains an ensemble of \(2^n\) different sub-networks.
Layer Normalization
$$\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu}{\sqrt{\sigma^2 + \epsilon}} \odot \boldsymbol{\gamma} + \boldsymbol{\beta}$$
$$\text{where } \mu = \frac{1}{d}\sum_{i=1}^{d}x_i, \quad \sigma^2 = \frac{1}{d}\sum_{i=1}^{d}(x_i - \mu)^2$$
Normalizes across the feature dimension for each individual example. Unlike batch normalization (which normalizes across the batch), LayerNorm doesn't depend on batch size and works well for transformers and variable-length sequences.
Pre-Norm vs Post-Norm
Post-Norm (original transformer): \(\text{LayerNorm}(x + \text{sublayer}(x))\) β applies norm after the residual. Requires careful learning rate warmup.
Pre-Norm (modern standard): \(x + \text{sublayer}(\text{LayerNorm}(x))\) β applies norm before the sublayer. More stable training, dominant in modern LLMs.
Β· Β· Β·
The Evolution of Gradient Descent
SGD with Momentum
Plain SGD oscillates in narrow valleys (high curvature in one direction, low in another). Momentum adds a "velocity" term that smooths out oscillations β like a ball rolling downhill that maintains its direction.
Momentum update
$$\mathbf{v}_t = \beta \mathbf{v}_{t-1} + (1-\beta)\nabla_\theta \mathcal{L}$$
$$\theta_{t+1} = \theta_t - \eta \mathbf{v}_t$$
\(\beta \approx 0.9\) means the velocity is an exponential moving average of recent gradients. If the gradient consistently points in one direction, momentum accelerates; if it oscillates, momentum dampens.
RMSProp β Adaptive Learning Rates
$$\mathbf{s}_t = \beta_2 \mathbf{s}_{t-1} + (1-\beta_2) (\nabla\mathcal{L})^2$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\mathbf{s}_t} + \epsilon} \nabla\mathcal{L}$$
Parameters with large gradients get smaller effective learning rates; parameters with small gradients get larger ones. This is crucial when different parameters have very different gradient magnitudes.
Adam β The Standard
Combines momentum (first moment) with RMSProp (second moment), plus bias correction:
Adam optimizer
$$\mathbf{m}_t = \beta_1 \mathbf{m}_{t-1} + (1-\beta_1)\nabla\mathcal{L} \quad \text{(first moment β mean)}$$
$$\mathbf{v}_t = \beta_2 \mathbf{v}_{t-1} + (1-\beta_2)(\nabla\mathcal{L})^2 \quad \text{(second moment β variance)}$$
$$\hat{\mathbf{m}}_t = \frac{\mathbf{m}_t}{1-\beta_1^t}, \quad \hat{\mathbf{v}}_t = \frac{\mathbf{v}_t}{1-\beta_2^t} \quad \text{(bias correction)}$$
$$\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} \hat{\mathbf{m}}_t$$
Default hyperparameters: \(\beta_1 = 0.9\), \(\beta_2 = 0.999\), \(\epsilon = 10^{-8}\), \(\eta = 3 \times 10^{-4}\).
Why bias correction? Both \(\mathbf{m}\) and \(\mathbf{v}\) are initialized to 0. In early steps, they're biased toward 0 because the exponential average hasn't accumulated enough history. Dividing by \((1-\beta^t)\) corrects for this: when \(t=1\) and \(\beta_1=0.9\), the correction factor is \(1/(1-0.9) = 10\), which exactly compensates for the initialization bias.
AdamW β Decoupled Weight Decay
$$\theta_{t+1} = \theta_t - \eta\left(\frac{\hat{\mathbf{m}}_t}{\sqrt{\hat{\mathbf{v}}_t} + \epsilon} + \lambda\theta_t\right)$$
Standard Adam applies L2 regularization to the gradient before the adaptive scaling, which weakens its effect. AdamW applies weight decay directly to the parameters, which is mathematically cleaner and empirically better. AdamW is the standard optimizer for modern LLM training.
Learning Rate Scheduling
Most LLM training uses a combination of warmup and cosine decay:
$$\eta_t = \begin{cases} \eta_{\max} \cdot \frac{t}{T_{\text{warmup}}} & t \leq T_{\text{warmup}} \\[6pt] \eta_{\min} + \frac{1}{2}(\eta_{\max} - \eta_{\min})\left(1 + \cos\frac{\pi(t - T_{\text{warmup}})}{T - T_{\text{warmup}}}\right) & t > T_{\text{warmup}}\end{cases}$$
Warmup (typically 1β5% of training) gradually increases the learning rate from near-zero to prevent early instability. Cosine decay smoothly reduces it to near-zero by the end of training.
(a) In Adam, what does a large \(\hat{\mathbf{v}}_t\) value for a particular parameter mean? How does this affect the effective learning rate for that parameter?
(b) Why is warmup especially important for transformers but less critical for CNNs?
(c) If you're training a 7B model for 2T tokens with batch size 4M tokens and warmup ratio 2%, how many warmup steps is that?
Β· Β· Β·
How Word Arithmetic Works
The famous equation: \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\)
The word embedding space has learned that there's a direction that represents "gender." The vector \(\vec{\text{king}} - \vec{\text{man}}\) isolates the "royalty" component by subtracting out "maleness." Adding \(\vec{\text{woman}}\) puts "femaleness" back in, landing near "queen." This works because the embedding space has linear structure β abstract concepts are encoded as consistent directions.
Mathematically, this works because the training objective (predict context words) forces the dot product \(\mathbf{u}_o^T \mathbf{v}_c\) to capture co-occurrence statistics, and co-occurrence patterns reflect semantic and syntactic relationships. The result: semantic relationships become geometric relationships.
Static vs Contextual Embeddings
| Aspect | Static (Word2Vec, GloVe) | Contextual (BERT, GPT) |
| One word = | One fixed vector | Different vector per context |
| "bank" in "river bank" | Same vector | Different vector (near "river", "shore") |
| "bank" in "bank account" | Same vector | Different vector (near "money", "finance") |
| Trained on | Co-occurrence statistics | Full language modeling objective |
| Captures | Type-level semantics | Token-level semantics (word sense) |
Contextual embeddings are what make modern NLP so powerful. Each token's representation is computed by the transformer as a function of the entire surrounding context. The same word gets different representations depending on meaning, part of speech, and role in the sentence.
Sentence and Document Embeddings
To get a single vector for an entire sentence, common approaches include: mean-pooling all token embeddings (simple but effective), using the [CLS] token embedding from BERT, or using specialized models trained with contrastive objectives (like Sentence-BERT, E5, or BGE). These are used for semantic search, clustering, and retrieval-augmented generation (RAG).
Β· Β· Β·
Elhage et al. (2021β2024) β Anthropic's mechanistic interpretability work ("A Mathematical Framework for Transformer Circuits", "Toy Models of Superposition", "Scaling Monosemanticity"). Pioneering work on reverse-engineering what individual neurons and circuits in transformers actually compute.
Attention Visualization
By examining attention weight matrices, we can see what each token "looks at." Early layers tend to show local attention patterns (nearby tokens). Middle layers often show syntactic patterns (subject attending to its verb). Late layers show more diffuse, task-specific patterns.
However, attention weights alone don't tell the full story β what matters is the combination of attention weights and value vectors. A head can have high attention weight on a token but project it through a near-zero value, effectively ignoring it.
Probing Classifiers
To test whether a model has learned a concept (e.g., part of speech, syntactic tree depth, entity type), we freeze the model and train a simple linear classifier on its hidden states. If a linear probe achieves high accuracy, the information is linearly encoded in the representation β meaning the model has learned to organize its internal representations so that this information is easily extractable.
Superposition β The Key Challenge
A model with 4096 dimensions might need to represent far more than 4096 concepts. Superposition is the phenomenon where models encode more features than they have dimensions, by using near-orthogonal directions. If two features rarely occur together, they can share the same dimensions without interfering.
This is why individual neurons are often "polysemantic" β a single neuron might activate for both "academic citations" and "references in code." The true "features" are not individual neurons but directions in activation space, and there may be millions of them encoded in thousands of dimensions.
Circuits β How Computation Happens
A circuit is a subgraph of the model that implements a specific behavior. For example, "induction heads" are a two-layer circuit discovered in transformers:
Layer 1 head: Copies previous token information into the current position ("previous-token head").
Layer 2 head: Attends to positions where the previous token matches the current token ("induction head").
Together, they implement the pattern: "if the sequence [A][B]....[A] has occurred, predict [B] will come next." This is a fundamental mechanism for in-context learning.
The Residual Stream View
A powerful way to think about transformers: the residual connections create a "residual stream" β a \(d\)-dimensional information highway that flows from input to output. Each attention head and MLP layer reads from and writes to this stream. The final prediction is made by the output logit layer reading from the final residual stream. This view makes it natural to think about which components contribute what information to the final prediction.
Sparse Autoencoders (SAEs) β Finding Features
Recent work has shown that training sparse autoencoders on model activations can decompose polysemantic neurons into interpretable, monosemantic features. The autoencoder learns a dictionary of directions in activation space, where each direction corresponds to an interpretable concept (a specific topic, a syntactic role, a type of reasoning).
(a) If a linear probe for "is this token a verb?" achieves 95% accuracy on BERT's layer 6 representations but only 60% on layer 1, what does this tell you about how linguistic knowledge develops through the layers?
(b) An induction head implements "[A][B]...[A] β predict [B]." How could this mechanism help with in-context learning when the model is given examples like "catβchat, dogβchien, fishβ"?
(c) If a model has 4096 dimensions but uses superposition to encode 100,000 features, approximately how many features can be active simultaneously without interference? (Hint: think about near-orthogonal vectors in high dimensions.)
Β· Β· Β·
The Grand Picture
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β THE LLM KNOWLEDGE STACK β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ALIGNMENT (RLHF / DPO) β
β ββ Objective: match human preferences β
β ββ Math: reward model + KL-constrained RL β
β β
β ARCHITECTURE (Transformer) β
β ββ Core: self-attention + FFN + residuals β
β ββ Math: softmax(QK^T/βd)V β
β ββ Innovations: RoPE, GQA, MoE, SwiGLU β
β β
β TRAINING OBJECTIVE (Next-token prediction) β
β ββ Loss: cross-entropy = negative log-likelihood β
β ββ Optimizer: AdamW + cosine LR schedule β
β ββ Distributed: data + tensor + pipeline parallelβ
β β
β REPRESENTATIONS (Embeddings) β
β ββ Tokens β dense vectors β
β ββ Contextual: each layer refines meaning β
β ββ Similarity = dot product in embedding space β
β β
β LEARNING ENGINE (Backpropagation + SGD) β
β ββ Chain rule through computational graph β
β ββ Gradient β (error Γ input feature) β
β β
β FOUNDATIONS (Linear Algebra + Probability) β
β ββ Dot product = similarity β
β ββ Matrix multiply = learned transformation β
β ββ MLE = principled loss function derivation β
β β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Every layer of this stack builds on the one below. The math of dot products enables attention. Attention enables contextual representations. Contextual representations enable next-token prediction. Next-token prediction at scale produces emergent understanding. Alignment steers that understanding toward being helpful.
Your Path to Reading Research Papers
Now that you have the mathematical foundations, here's how to approach a new research paper:
1. Read the abstract and introduction β understand the problem being solved and the claim being made.
2. Look at figures first β architecture diagrams and result tables give you the high-level picture before you dive into equations.
3. Map new notation to what you know β Most papers use slight variations of the same constructs: attention, FFN, residual connections, loss functions. Identify which building blocks they use.
4. Focus on what's new β Skip the background sections. Look for the one or two novel equations or architectural choices that distinguish this paper.
5. Check the ablations β The ablation study tells you which parts of the method actually matter. This is often more informative than the main results.
6. Ask "what would I do differently?" β This is how you develop research taste. Every paper makes tradeoffs β understanding them is the sign of deep comprehension.
"In mathematics you don't understand things. You just get used to them." β John von Neumann
The same is true for ML papers. Read them repeatedly. Implement the math. The understanding follows.