Companion Volume III โ Frontier Topics & Complete Solutions
ML First Principles, Part III:
Advanced Topics & Practice
Efficient attention, parameter-efficient fine-tuning, quantization, reasoning models,
multi-modal architectures, RAG, evaluation โ and solutions to every exercise.
The Quadratic Bottleneck
Standard self-attention computes an \(n \times n\) attention matrix, where \(n\) is the sequence length. Both memory and computation are \(O(n^2 d)\). For a 128K token context at \(d = 4096\): the attention matrix alone has \(128K^2 = 16.4\) billion entries per head. This is the fundamental scaling challenge.
Flash Attention โ Exact Attention, Hardware-Aware
Dao et al. (2022, 2023) โ "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" and "FlashAttention-2." Reformulated attention to be tiling-friendly, reducing memory from \(O(n^2)\) to \(O(n)\) without any approximation.
GPUs have a hierarchy of memory: HBM (large, slow) and SRAM (small, fast). Standard attention materializes the full \(n \times n\) matrix in HBM, then reads it back for softmax โ this memory transfer is the bottleneck, not the arithmetic. Flash Attention computes attention in tiles that fit in SRAM, never materializing the full matrix.
The Key Trick: Online Softmax
The challenge is that softmax requires the maximum value across the entire row (for numerical stability) and the sum of all exponentials (for normalization). Flash Attention uses an online algorithm that processes blocks incrementally:
Online Softmax Accumulation
$$m^{(j)} = \max(m^{(j-1)}, \max(\mathbf{S}^{(j)}))$$
$$\ell^{(j)} = e^{m^{(j-1)} - m^{(j)}} \ell^{(j-1)} + \text{rowsum}(e^{\mathbf{S}^{(j)} - m^{(j)}})$$
$$\mathbf{O}^{(j)} = e^{m^{(j-1)} - m^{(j)}} \mathbf{O}^{(j-1)} + e^{\mathbf{S}^{(j)} - m^{(j)}} \mathbf{V}^{(j)}$$
Here \(m\) tracks the running maximum, \(\ell\) tracks the running softmax denominator, and \(\mathbf{O}\) accumulates the unnormalized output. After processing all blocks, the final output is \(\mathbf{O} / \ell\). The key insight: the correction factor \(e^{m^{(j-1)} - m^{(j)}}\) adjusts previous accumulations when a new block reveals a larger maximum.
Result: 2โ4ร faster, \(O(n)\) memory instead of \(O(n^2)\), and the answer is mathematically identical to standard attention.
Sliding Window Attention (Mistral/Gemma-style)
Instead of attending to all previous tokens, each layer only attends to the most recent \(W\) tokens:
$$\text{Attention}(q_t, K, V) = \text{softmax}\!\left(\frac{q_t K_{[t-W:t]}^T}{\sqrt{d_k}}\right) V_{[t-W:t]}$$
With \(L\) layers and window size \(W\), the effective receptive field is \(L \times W\) tokens โ similar to how stacked CNN layers build large receptive fields from small kernels. Mistral 7B uses \(W = 4096\) with \(L = 32\), giving an effective field of 128K tokens.
RoPE Context Extension
Models trained with a maximum sequence length can be extended through RoPE scaling techniques:
Position Interpolation (linear scaling)
$$\theta'_i = \theta_i / s \quad \text{where } s = \frac{L_{\text{target}}}{L_{\text{trained}}}$$
YaRN (Yet another RoPE extensioN) applies different scaling factors to different frequency bands โ high-frequency components (which encode fine-grained position) are scaled more aggressively than low-frequency components (which encode coarse position). This preserves local position information while extending the overall range.
(a) Standard attention for sequence length 32K with \(d = 128\) per head: how many FLOPs for the \(\mathbf{QK}^T\) multiplication? How many bytes to store the attention matrix in FP16?
(b) With sliding window attention (\(W = 4096\)), what fraction of the full attention FLOPs do you use?
(c) Why does Flash Attention not help with the computational cost of attention, only the memory cost?
ยท ยท ยท
The Fine-Tuning Landscape
Full Fine-Tuning LoRA Prompt Tuning
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ Update ALL โ โ Freeze base โ โ Freeze ALL โ
โ parameters โ โ Add small โ โ Prepend soft โ
โ โ โ rank updates โ โ tokens only โ
โ 7B params โ โ ~0.1% params โ โ ~0.001% โ
โ ~28GB VRAM โ โ ~6GB VRAM โ โ minimal VRAM โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
Best quality Great tradeoff Limited capacity
LoRA โ Low-Rank Adaptation (Full Derivation)
Hu et al. (2021) โ "LoRA: Low-Rank Adaptation of Large Language Models." Showed that the weight updates during fine-tuning have low intrinsic rank โ you can approximate them with tiny low-rank matrices and achieve comparable quality to full fine-tuning.
During fine-tuning, you're changing a huge matrix \(\mathbf{W}\) to \(\mathbf{W} + \Delta\mathbf{W}\). The key insight is that \(\Delta\mathbf{W}\) typically has low rank โ it lives in a small subspace. Instead of storing the full \(\Delta\mathbf{W} \in \mathbb{R}^{d \times d}\), we decompose it into two small matrices.
The Math
For a weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\), instead of updating to \(\mathbf{W}_0 + \Delta\mathbf{W}\), we parametrize:
LoRA Decomposition
$$\mathbf{W} = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$
$$\text{where } \mathbf{B} \in \mathbb{R}^{d \times r}, \; \mathbf{A} \in \mathbb{R}^{r \times k}, \; r \ll \min(d, k)$$
The forward pass becomes:
$$\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$
Where \(\alpha\) is a scaling factor (controls the magnitude of the LoRA update relative to the pretrained weights).
Parameter Savings
For a model dimension \(d = 4096\), a weight matrix \(\mathbf{W} \in \mathbb{R}^{4096 \times 4096}\):
Full fine-tuning: \(4096 \times 4096 = 16.8M\) parameters per matrix.
LoRA with \(r = 16\): \(\mathbf{B}: 4096 \times 16 = 65K\) + \(\mathbf{A}: 16 \times 4096 = 65K\) = 131K parameters per matrix.
That's \(131K / 16.8M = 0.78\%\) of the original parameters. For all Q, K, V, and output projections across 32 layers: \(131K \times 4 \times 32 = 16.8M\) trainable parameters out of 7B total = 0.24%.
Initialization
\(\mathbf{A}\) is initialized with a random Gaussian, \(\mathbf{B}\) is initialized to zero. This means \(\Delta\mathbf{W} = \mathbf{BA} = 0\) at the start of training โ the model begins exactly as the pretrained model, and the LoRA update grows gradually from zero.
Merging
After training, you can merge LoRA weights back into the base model: \(\mathbf{W}_{\text{merged}} = \mathbf{W}_0 + \frac{\alpha}{r}\mathbf{BA}\). The merged model has zero additional inference cost โ it's the same size and speed as the original.
QLoRA โ Quantized LoRA
Dettmers et al. (2023) โ "QLoRA: Efficient Finetuning of Quantized Language Models." Fine-tune a 65B model on a single 48GB GPU by combining 4-bit quantization of the base model with LoRA adapters in 16-bit.
Three innovations: (1) 4-bit NormalFloat (NF4) โ a quantization type optimal for normally-distributed weights. (2) Double quantization โ quantize the quantization constants themselves. (3) Paged optimizers โ use CPU memory to handle GPU memory spikes.
Other PEFT Methods
| Method | What's Trainable | % Params | Quality |
| Full fine-tuning | Everything | 100% | Best (if enough data) |
| LoRA | Low-rank updates to attention | 0.1โ1% | Near-full quality |
| QLoRA | Same as LoRA, base in 4-bit | 0.1โ1% | Slightly below LoRA |
| Prefix tuning | Soft prefix tokens per layer | <0.1% | Good for specific tasks |
| Adapters | Small bottleneck layers inserted | 1โ5% | Good |
| BitFit | Only bias terms | <0.1% | Surprisingly decent |
ยท ยท ยท
A 70B parameter model in FP16 requires 140GB of memory โ far more than any consumer GPU. Quantization reduces each parameter from 16 bits to 8, 4, or even 2 bits, shrinking the model proportionally. The question is: how much quality do you lose?
The Math of Quantization
Linear quantization maps a floating-point value to a discrete integer:
Symmetric Quantization
$$x_q = \text{round}\!\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{2^{b-1} - 1}$$
$$\hat{x} = x_q \cdot s \quad \text{(dequantize for computation)}$$
Where \(s\) is the scale factor and \(b\) is the bit-width. For 8-bit: values map to \([-127, 127]\). For 4-bit: values map to \([-7, 7]\).
Why LLM Weights Tolerate Quantization
Neural network weights tend to follow a roughly Gaussian distribution centered near zero, with most values small and rare values large. The large outliers are the problem โ they force a wide scale factor \(s\), reducing resolution for the many small values. Modern methods handle this with:
Per-channel quantization: separate scale per output channel (row of weight matrix), so one outlier channel doesn't affect others.
GPTQ (Frantar et al., 2022): Post-training quantization that iteratively quantizes weights while compensating for the quantization error using second-order (Hessian) information. Quantizes one column at a time, adjusting remaining columns to minimize the output error.
AWQ (Lin et al., 2023): Observes that 1% of weights (the salient ones corresponding to important activations) matter much more than the rest. Protects these salient channels with per-channel scaling before quantization.
Memory Savings
| Precision | Bits/param | 7B Model | 70B Model |
| FP32 | 32 | 28 GB | 280 GB |
| FP16/BF16 | 16 | 14 GB | 140 GB |
| INT8 | 8 | 7 GB | 70 GB |
| INT4 (GPTQ/AWQ) | 4 | 3.5 GB | 35 GB |
| 2-bit (experimental) | 2 | 1.75 GB | 17.5 GB |
At 4-bit quantization, a 70B model fits on a single A100-80GB with room for KV cache. A 7B model fits on consumer GPUs with 6GB+ VRAM. Quality loss at INT4 is typically 1โ3% on benchmarks.
ยท ยท ยท
Chain-of-Thought Prompting
Wei et al. (2022) โ "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Showed that simply adding "Let's think step by step" or providing step-by-step examples dramatically improves reasoning performance.
An LLM predicting the answer to "What is 17 ร 24?" in a single forward pass must compute the answer in ~100 layer operations. But if it generates intermediate steps ("17 ร 20 = 340, 17 ร 4 = 68, 340 + 68 = 408"), each step is a much simpler computation, and the answer from each step feeds into the next through the context.
Why CoT works mathematically: Each generated token triggers a full forward pass through the network. Chain-of-thought effectively gives the model \(O(T)\) serial compute steps (where \(T\) is the number of reasoning tokens) instead of just the fixed depth of the network. It transforms a bounded-depth computation into an arbitrarily long sequential one.
Self-Consistency
Wang et al. (2022) โ "Self-Consistency Improves Chain of Thought Reasoning." Sample multiple reasoning paths and take a majority vote on the final answer. Diverse reasoning paths that converge on the same answer are more likely correct.
Self-Consistency
$$\hat{a} = \arg\max_a \sum_{i=1}^{K} \mathbb{1}[a_i = a]$$
Generate \(K\) different reasoning chains (using temperature sampling), extract the final answer from each, and take the majority vote. This exploits the fact that errors in reasoning tend to be diverse (different chains fail differently), while correct reasoning converges.
Test-Time Compute Scaling
OpenAI o1/o3 (2024โ2025) and DeepSeek-R1 (2025) โ Models trained with reinforcement learning to generate extended internal reasoning chains. These models spend more computation at inference time, trading tokens for accuracy.
The Scaling Law of Thought
Traditional scaling increases training compute (more parameters, more data). Reasoning models introduce a second axis: inference compute. The longer the model "thinks" (generates internal reasoning tokens), the better it performs on hard problems.
Test-Time Compute Scaling (empirical)
$$\text{Accuracy}(C_{\text{test}}) \approx A - B \cdot C_{\text{test}}^{-\alpha}$$
Where \(C_{\text{test}}\) is the number of inference tokens. Like training scaling laws, performance improves as a power law of compute โ but now the compute is spent at inference time.
DeepSeek-R1: RL for Reasoning
DeepSeek-R1 trains the model to produce long chains of thought using reinforcement learning with a simple reward: correctness of the final answer. The model learns to decompose problems, verify intermediate steps, backtrack when stuck, and allocate more tokens to harder sub-problems.
Key innovation โ emergent behaviors: The model spontaneously learns strategies like self-verification ("let me check this"), exploration of alternatives ("alternatively, I could..."), and reflection ("wait, that doesn't seem right") โ all from the simple reward signal of getting the final answer correct. No explicit instruction to reason this way was provided.
Tree of Thought
Yao et al. (2023) โ "Tree of Thoughts." Generalizes chain-of-thought from a single chain to a tree โ the model explores multiple reasoning branches, evaluates them, and can backtrack.
Chain-of-Thought: Tree of Thoughts:
Problem Problem
โ โ
โผ โโโโโผโโโโ
Step 1 Step 1a 1b 1c
โ โ โ
โผ โโโผโโ โโโผโโ
Step 2 2a 2b 2c 2d
โ โ โ โ โ
โผ โผ โผ
Answer Ans. Ans.
(vote)
ยท ยท ยท
Lewis et al. (2020) โ "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Combined a retriever (finds relevant documents) with a generator (produces the answer), allowing LLMs to access information beyond their training data.
An LLM's knowledge is frozen at training time. RAG gives it a "reference library" โ before answering, the model retrieves relevant documents and includes them in its context. This reduces hallucinations, enables up-to-date information, and provides citations.
The RAG Pipeline
โโโโโโโโโโโ โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โ User โโโโโโโ Embed โโโโโโโ Vector โโโโโโโ Top-k โ
โ Query โ โ query โ โ similarity โ โ chunks โ
โโโโโโโโโโโ โโโโโโโโโโโโโโ โ search โ โโโโโโฌโโโโโ
โโโโโโโโโโโโโโโโ โ
โผ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโ
โ LLM โโโโโโโ Augment โ
โ generates โ โ prompt โ
โ answer โ โ with โ
โโโโโโโโโโโโโโโโ โ chunks โ
โโโโโโโโโโโ
Step 1: Document Processing
Split documents into chunks (typically 256โ512 tokens with overlap), then embed each chunk using a sentence embedding model:
$$\mathbf{e}_i = f_{\text{embed}}(\text{chunk}_i) \in \mathbb{R}^d$$
Popular embedding models: E5, BGE, Cohere embed-v3, OpenAI text-embedding-3. Store the embedding vectors in a vector database (FAISS, Pinecone, Milvus, Chroma).
Step 2: Retrieval
Embed the user query with the same model, then find the nearest chunks by cosine similarity:
$$\text{score}(q, c_i) = \frac{\mathbf{e}_q^T \mathbf{e}_{c_i}}{\|\mathbf{e}_q\| \|\mathbf{e}_{c_i}\|}$$
Return the top-\(k\) chunks (typically \(k = 3\text{โ}10\)).
Step 3: Augmented Generation
Construct a prompt like: [System: Use the following context to answer. Context: {chunk1} {chunk2} ...] [User: {query}]
Advanced RAG Techniques
Reranking: After initial retrieval, use a cross-encoder model (which sees query + document jointly, not just embeddings) to re-score and reorder results. Much more accurate than embedding-only similarity but too slow for the initial search.
Hybrid search: Combine semantic search (embeddings) with lexical search (BM25/TF-IDF) for better recall. Semantic search finds paraphrases; lexical search catches exact keywords.
Query transformation: Rewrite the query to improve retrieval โ decompose complex questions, expand abbreviations, generate hypothetical answer passages (HyDE).
ยท ยท ยท
Vision Transformers (ViT)
Dosovitskiy et al. (2020) โ "An Image Is Worth 16x16 Words." Applied the transformer directly to images by splitting them into patches, treating each patch as a "token." Proved that pure transformer architectures match or beat CNNs for vision tasks when trained at sufficient scale.
How ViT Works
Image โ Sequence of Tokens
$$\text{Image} \in \mathbb{R}^{H \times W \times C} \longrightarrow \text{Patches} \in \mathbb{R}^{N \times (P^2 \cdot C)} \xrightarrow{\mathbf{W}_{\text{patch}}} \text{Tokens} \in \mathbb{R}^{N \times d}$$
A 224ร224 image with 16ร16 patches gives \(N = (224/16)^2 = 196\) tokens. Each patch is linearly projected to dimension \(d\), then processed by a standard transformer encoder. Classification uses a special [CLS] token prepended to the sequence.
CLIP โ Connecting Vision and Language
Radford et al. (2021) โ "Learning Transferable Visual Models From Natural Language Supervision." Trained a vision encoder and text encoder jointly on 400M image-text pairs from the internet, creating a shared embedding space where images and text are directly comparable.
Contrastive Learning Objective
CLIP Loss (for a batch of N image-text pairs)
$$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(\mathbf{v}_i^T\mathbf{t}_i / \tau)}{\sum_{j=1}^{N}\exp(\mathbf{v}_i^T\mathbf{t}_j / \tau)} + \log\frac{\exp(\mathbf{t}_i^T\mathbf{v}_i / \tau)}{\sum_{j=1}^{N}\exp(\mathbf{t}_i^T\mathbf{v}_j / \tau)}\right]$$
Where \(\mathbf{v}_i\) is the image embedding, \(\mathbf{t}_i\) is the text embedding, and \(\tau\) is a learned temperature parameter. The loss maximizes similarity between matched pairs and minimizes similarity between unmatched pairs.
Vision-Language Models (LLaVA Pattern)
Modern vision-language models follow a simple recipe:
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Image โโโโโโโ Vision โโโโโโโ Projection โ
โ โ โ Encoder โ โ (MLP or linear) โ
โ โ โ (ViT/CLIP) โ โ โ visual tokens โ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโฌโโโโโโโโโโ
โ
โผ
โโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโโโโ
โ Text โโโโโโโโโโโโโโโโโโโโโโโโโโโโ LLM Decoder โ
โ tokens โ โ [visual tokens] โ
โ โ โ [text tokens] โ
โโโโโโโโโโโโโโ โ โ response โ
โโโโโโโโโโโโโโโโโโโโ
The vision encoder extracts visual features, a projection layer maps them into the LLM's embedding space as "visual tokens," and the LLM processes both visual and text tokens with its standard self-attention mechanism. From the LLM's perspective, image features look just like text tokens.
ยท ยท ยท
Perplexity โ The Fundamental LM Metric
Perplexity
$$\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T}\log P(x_t \mid x_{
Perplexity is the exponential of the average cross-entropy loss. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each step. Lower is better. A perfectly confident model would have perplexity 1.
Generation Metrics
| Metric | Formula Intuition | Best For |
| BLEU | Precision of n-gram overlap with reference | Machine translation |
| ROUGE-L | Longest common subsequence with reference | Summarization |
| BERTScore | Cosine similarity of BERT embeddings per token | Semantic similarity |
| METEOR | Harmonic mean of precision & recall with synonyms | Translation (improved) |
N-gram metrics are poor evaluators of LLMs. BLEU and ROUGE only measure surface-level overlap. A paraphrase that perfectly captures the meaning but uses different words scores poorly. Modern LLM evaluation increasingly relies on human evaluation and LLM-as-judge methods (using a stronger model to evaluate a weaker model's outputs).
LLM Benchmarks
| Benchmark | What It Tests | Format |
| MMLU | Broad knowledge across 57 subjects | Multiple choice |
| HumanEval / MBPP | Code generation correctness | Function completion โ unit tests |
| GSM8K | Grade-school math reasoning | Word problems โ numerical answer |
| MATH | Competition-level math | Hard problems โ solutions |
| ARC-Challenge | Science reasoning | Multiple choice |
| HellaSwag | Common sense completion | Choose most plausible continuation |
| TruthfulQA | Resistance to common misconceptions | Open-ended generation |
| MT-Bench / Chatbot Arena | Overall conversational quality | LLM-as-judge / human preference |
The Benchmark Contamination Problem: If benchmark data appears in the training corpus, the model has "seen the test." This is increasingly problematic as training datasets grow to encompass most of the internet. Solutions include: dynamic benchmarks that change over time, private held-out test sets, and evaluating on tasks that require reasoning rather than recall.
ยท ยท ยท
Knowledge Distillation
Train a smaller "student" model to mimic a larger "teacher" model by matching the teacher's output probability distribution, not just the hard labels:
Distillation Loss
$$\mathcal{L}_{\text{distill}} = (1-\alpha)\,\mathcal{L}_{\text{CE}}(y, p_S) + \alpha\,T^2 \cdot D_{\text{KL}}(p_T^{(T)} \| p_S^{(T)})$$
Where \(p_T^{(T)}\) and \(p_S^{(T)}\) are the teacher and student distributions at temperature \(T\). Higher temperature makes the distributions softer, revealing more structure (the relative probabilities of wrong answers contain information about similarity between classes).
Speculative Decoding
LLM inference is memory-bandwidth-bound, not compute-bound โ the GPU sits idle while loading weights from memory. Speculative decoding uses a small, fast "draft" model to generate several candidate tokens, then verifies them all at once with the large model. If the draft model is correct (which it often is for easy tokens), you get multiple tokens from one large-model forward pass.
Acceptance probability
$$P(\text{accept}) = \min\!\left(1, \frac{p_{\text{target}}(x)}{p_{\text{draft}}(x)}\right)$$
Tokens where the draft model is more confident than the target model are always accepted. Tokens where it's less confident are accepted with probability equal to the ratio. This guarantees the output distribution is identical to the target model โ speculative decoding is an exact optimization, not an approximation.
State-Space Models (SSMs)
Gu & Dao (2023) โ "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." Proposed a selective state-space model that processes sequences in \(O(n)\) time (vs \(O(n^2)\) for attention) while matching transformer quality on language tasks up to moderate scale.
The SSM Equation
Continuous State-Space Model
$$\mathbf{h}'(t) = \mathbf{A}\mathbf{h}(t) + \mathbf{B}x(t)$$
$$y(t) = \mathbf{C}\mathbf{h}(t) + Dx(t)$$
Discretized for sequences (with step size \(\Delta\)): \(\mathbf{h}_t = \bar{\mathbf{A}}\mathbf{h}_{t-1} + \bar{\mathbf{B}}x_t\). This is a linear recurrence โ it can be computed either recurrently (\(O(n)\) sequential) or as a convolution (\(O(n \log n)\) parallel).
Mamba's key innovation โ selective mechanism: Making \(\mathbf{B}\), \(\mathbf{C}\), and \(\Delta\) input-dependent (functions of \(x_t\)) allows the model to selectively remember or forget information, similar to LSTM gates but within the SSM framework. This breaks the linearity needed for convolution mode, so Mamba uses a hardware-aware scan algorithm instead.
Constitutional AI (CAI)
Bai et al. (2022) โ "Constitutional AI: Harmlessness from AI Feedback." Instead of relying entirely on human feedback for alignment, provide the model with a set of principles ("constitution") and use AI-generated feedback: the model critiques its own outputs according to the constitution, then revises them.
The RLAIF pipeline: (1) Generate responses. (2) Ask the model to critique responses against principles. (3) Ask the model to revise responses. (4) Train a preference model on (original, revised) pairs. (5) Use this as the reward model in standard RL training.
ยท ยท ยท
Solutions to all exercises from all three parts, in order.
Part I โ Mathematical Foundations
ยง1.1 Linear Algebra
(a) \(\mathbf{w}^T\mathbf{x}_1 = 0.5(4) + (-1.0)(2) = 2.0 - 2.0 = 0.0\)
\(\mathbf{w}^T\mathbf{x}_2 = 0.5(1) + (-1.0)(5) = 0.5 - 5.0 = -4.5\)
The weight vector "prefers" \(\mathbf{x}_1\) (score 0.0 vs -4.5) โ or more precisely, \(\mathbf{x}_1\) is less negatively aligned with \(\mathbf{w}\). In a classifier context, \(\mathbf{x}_1\) is closer to the positive side of the decision boundary.
(b) \(\mathbf{x}_1 \cdot \mathbf{x}_2 = 4(1) + 2(5) = 14\)
\(\|\mathbf{x}_1\| = \sqrt{16 + 4} = \sqrt{20} \approx 4.47\)
\(\|\mathbf{x}_2\| = \sqrt{1 + 25} = \sqrt{26} \approx 5.10\)
\(\cos\theta = 14 / (4.47 \times 5.10) \approx 14 / 22.8 \approx 0.614\)
Moderately similar โ the angle between them is about 52ยฐ.
ยง1.3 Optimization
(a) \(f(\theta) = (\theta - 3)^2\), so \(\nabla f = 2(\theta - 3)\).
Step 0: \(\theta_0 = 0\), gradient = \(2(0-3) = -6\), update: \(\theta_1 = 0 - 0.1(-6) = 0.6\)
Step 1: \(\theta_1 = 0.6\), gradient = \(2(0.6-3) = -4.8\), update: \(\theta_2 = 0.6 + 0.48 = 1.08\)
Step 2: \(\theta_2 = 1.08\), gradient = \(2(1.08-3) = -3.84\), update: \(\theta_3 = 1.08 + 0.384 = 1.464\)
Moving toward the minimum at \(\theta = 3\), but slowly with \(\eta = 0.1\).
(b) \(y = \sigma(wx + b)\). Let \(z = wx + b\).
\(\frac{\partial y}{\partial w} = \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w} = \sigma(z)(1 - \sigma(z)) \cdot x\)
Or substituting: \(\frac{\partial y}{\partial w} = y(1-y) \cdot x\).
Part I โ Core Machine Learning
ยง2.2 Logistic Regression
(a) \(z = w_{\text{free}} x_{\text{free}} + w_{\text{meeting}} x_{\text{meeting}} + b\)
\(z = 2.0(1) + (-1.5)(1) + (-0.5) = 2.0 - 1.5 - 0.5 = 0.0\)
\(P(\text{spam}) = \sigma(0.0) = 0.5\).
The model is perfectly uncertain โ the positive signal from "free" is exactly cancelled by "meeting" and the bias.
(b) With \(y = 1\) and \(\hat{p} = 0.5\):
\(\mathcal{L} = -[1 \cdot \log(0.5) + 0 \cdot \log(0.5)] = -\log(0.5) = \ln 2 \approx 0.693\)
This is the maximum possible loss for a single example โ the model is at maximum uncertainty.
(c) \(\frac{\partial \mathcal{L}}{\partial w_{\text{free}}} = (\hat{p} - y) \cdot x_{\text{free}} = (0.5 - 1)(1) = -0.5\)
Update: \(w_{\text{free}}' = 2.0 - 0.1(-0.5) = 2.0 + 0.05 = 2.05\)
The weight for "free" increases, making the model slightly more inclined to classify emails with "free" as spam.
Part I โ Neural Networks
ยง3.3 Backpropagation
From the forward pass: \(a^{[2]} = 0.59\), \(y = 1\), \(\mathbf{a}^{[1]} = [0.25, 0.40]^T\), \(\mathbf{x} = [1, 0.5]^T\).
\(\mathbf{W}^{[2]} = [0.5, 0.6]\).
(a) \(\delta^{[2]} = 0.59 - 1 = -0.41\)
(b) \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} \cdot (\mathbf{a}^{[1]})^T = -0.41 \times [0.25, 0.40] = [-0.1025, -0.164]\)
(c) \(\boldsymbol{\delta}^{[1]} = (\mathbf{W}^{[2]})^T \delta^{[2]} \odot \text{ReLU}'(\mathbf{z}^{[1]})\)
\(= \begin{bmatrix}0.5\\0.6\end{bmatrix} \times (-0.41) \odot \begin{bmatrix}1\\1\end{bmatrix} = \begin{bmatrix}-0.205\\-0.246\end{bmatrix}\)
(d) With \(\eta = 0.1\):
\(W_1^{[2]}: 0.5 - 0.1(-0.1025) = 0.5103\)
\(W_2^{[2]}: 0.6 - 0.1(-0.164) = 0.6164\)
\(W_{11}^{[1]}: 0.1 - 0.1(-0.205)(1) = 0.1205\)
\(W_{12}^{[1]}: 0.3 - 0.1(-0.205)(0.5) = 0.3103\)
\(W_{21}^{[1]}: 0.2 - 0.1(-0.246)(1) = 0.2246\)
\(W_{22}^{[1]}: 0.4 - 0.1(-0.246)(0.5) = 0.4123\)
All weights increase (since \(\delta < 0\) means the model under-predicted), pushing the output closer to 1.
Part I โ Deep Learning Architectures
ยง4.2 RNNs & LSTMs
(a) If forget gate \(\mathbf{f}_t = \mathbf{1}\) and input gate \(\mathbf{i}_t = \mathbf{0}\):
\(\mathbf{c}_t = \mathbf{1} \odot \mathbf{c}_{t-1} + \mathbf{0} \odot \tilde{\mathbf{c}}_t = \mathbf{c}_{t-1}\)
The cell state is perfectly preserved โ nothing is forgotten, nothing new is added. This is the "information highway" mode.
Reverse (\(\mathbf{f}_t = \mathbf{0}\), \(\mathbf{i}_t = \mathbf{1}\)):
\(\mathbf{c}_t = \mathbf{0} \odot \mathbf{c}_{t-1} + \mathbf{1} \odot \tilde{\mathbf{c}}_t = \tilde{\mathbf{c}}_t\)
Previous memory is completely erased and replaced with the new candidate. This is a "reset" mode.
(b) An additive update means \(\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t\) (just the forget gate, an element-wise scaling). If \(\mathbf{f}_t \approx 1\), the gradient flows through unchanged โ no vanishing!
A multiplicative update like \(\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} \odot \tilde{\mathbf{c}}_t\) would make the gradient a product of many terms, leading to vanishing/exploding. The additive structure is the key insight.
Part I โ NLP Foundations
ยง5.1 Word Embeddings
(a) The equation \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\) implies that semantic relationships are encoded as linear directions in the embedding space. Specifically:
\(\vec{\text{king}} - \vec{\text{queen}} \approx \vec{\text{man}} - \vec{\text{woman}}\)
This means there exists a "gender direction" that is approximately consistent across different word pairs. Individual dimensions don't correspond to single concepts โ rather, directions (linear combinations of dimensions) encode semantic attributes like gender, royalty, plurality, tense, etc.
(b) Using the same matrix for center and context embeddings would force \(\mathbf{v}_w^T \mathbf{v}_w\) (the self-similarity) to always be the squared norm โ it can't be negative. This means the model can't learn that a word is a poor predictor of itself as context (which is usually the case โ "the" rarely predicts another "the" nearby). Two separate matrices give more flexibility: the center and context roles are genuinely different, and having separate embeddings lets the model capture these different roles.
Part I โ Attention & Transformers
ยง6.4 Transformer Architecture
(a) With \(d = 256\) and \(h = 4\) heads: \(d_k = d_v = d/h = 64\).
Each \(\mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V \in \mathbb{R}^{256 \times 64}\).
(b) Causal mask for sequence length 4:
\(\mathbf{M} = \begin{bmatrix}0 & -\infty & -\infty & -\infty\\0 & 0 & -\infty & -\infty\\0 & 0 & 0 & -\infty\\0 & 0 & 0 & 0\end{bmatrix}\)
After softmax, row \(i\) of the attention matrix is a probability distribution over positions \(0, 1, \ldots, i\). Each row represents: "given that I'm at position \(i\), how much should I attend to each past position (including myself)?"
(c) Residual connections let the gradient flow directly through the identity path: \(\frac{\partial (\mathbf{x} + f(\mathbf{x}))}{\partial \mathbf{x}} = \mathbf{I} + \frac{\partial f}{\partial \mathbf{x}}\). Even if \(\frac{\partial f}{\partial \mathbf{x}}\) is small (vanishing), the identity term \(\mathbf{I}\) guarantees a gradient of at least 1. Without residual connections, a 96-layer transformer would be untrainable due to vanishing gradients; with them, gradients can skip directly from the loss to any layer.
Part I โ Modern LLMs
ยง7.4โ7.5 MoE and RLHF
(a) With 64 experts and top-2 routing: \(2/64 = 3.125\%\) of parameters active per token. This is beneficial because: (1) Training: total model capacity is huge (enabling more knowledge storage), but each gradient update is cheap (only updating 2 experts + shared components). (2) Inference: each forward pass only needs to load and compute 2 experts' worth of weights, dramatically reducing latency and memory bandwidth requirements.
(b) If \(\beta \to 0\): the KL constraint vanishes. The model optimizes reward without restriction, likely leading to "reward hacking" โ finding degenerate outputs that score high on the imperfect reward model but are nonsensical or harmful.
If \(\beta \to \infty\): the KL term dominates. The model is forced to stay extremely close to the reference policy and barely changes. Alignment has almost no effect.
Good values of \(\beta\) balance these extremes.
(c) A model trained only on next-token prediction learns to predict what text looks like on the internet โ including toxic, biased, unhelpful, or harmful text. If asked "How do I hotwire a car?", the pretrained model predicts the most likely continuation, which is a direct answer. RLHF teaches the model that the preferred response is to decline or redirect. More generally, RLHF shifts the model from "what would the internet say?" to "what would a helpful, harmless, honest assistant say?"
Part III โ Advanced Topics
ยงI Efficient Attention
(a) \(\mathbf{QK}^T\) for sequence length \(n = 32768\), \(d_k = 128\):
FLOPs: \(2 \times n \times n \times d_k = 2 \times 32768^2 \times 128 \approx 2.75 \times 10^{11}\) FLOPs per head.
Memory for attention matrix in FP16: \(n^2 \times 2\) bytes = \(32768^2 \times 2 = 2.15\) GB per head. With 32 heads, that's ~69 GB just for attention matrices โ clearly the bottleneck.
(b) Sliding window (\(W = 4096\)): each token only attends to \(W\) tokens instead of \(n\).
FLOPs ratio: \(W/n = 4096/32768 = 12.5\%\) of the full attention cost.
(c) Flash Attention computes the exact same \(\mathbf{QK}^T\) and \(\text{softmax}(...)\mathbf{V}\) โ it doesn't reduce the number of arithmetic operations. It reduces memory IO: by tiling the computation to fit in fast SRAM, it avoids the expensive round-trip of writing the \(n \times n\) attention matrix to slow HBM and reading it back. The speedup comes entirely from better memory access patterns, not fewer FLOPs. (In practice, modern GPUs are memory-bandwidth-bound for attention, so reducing IO is more impactful than reducing FLOPs.)
ยงIV LLM Training Pipeline
(a) \(\text{Parameters} \approx 12 \times L \times d^2 = 12 \times 32 \times 4096^2 = 12 \times 32 \times 16.78M = 6.44B\).
LLaMA-7B has ~6.7B parameters. Our estimate of 6.44B is very close โ the \(12Ld^2\) formula accounts for: 4 attention projections (\(Q, K, V, O\) each \(d \times d\)), 2 FFN projections (\(d \times 4d\) and \(4d \times d\)), plus biases and layer norms (small). The slight undercount is because the formula ignores the embedding layer (\(V \times d\) where \(V\) is vocab size).
(b) Gradient updates = total tokens / batch size = \(2 \times 10^{12} / (4 \times 10^6) = 500{,}000\) steps.
At typical training throughput, this might take 20โ30 days on a 1024-GPU cluster.
(c) BF16 has 8 exponent bits (same as FP32) but only 7 mantissa bits (vs 23 for FP32 and 10 for FP16). FP16 has only 5 exponent bits, giving a range of \([6 \times 10^{-8}, 65504]\). Gradients often exceed 65504 (overflow โ NaN) or fall below \(6 \times 10^{-8}\) (underflow โ 0). BF16's range is \([10^{-38}, 3.4 \times 10^{38}]\), matching FP32. You lose some precision, but precision matters less than range for gradient accumulation.
ยท ยท ยท
What's Next?
You now have a complete mathematical and conceptual foundation spanning from dot products to frontier LLM architectures. The path forward is hands-on:
1. Implement. Build a transformer from scratch in PyTorch. Karpathy's nanoGPT and minbpe are the gold standard starting points. The gap between understanding the math and writing the code is where real understanding forms.
2. Read papers. Start with the annotated reading list from Part I. With the mathematical foundation from these three volumes, you should be able to follow the key equations in any modern NLP paper. When you encounter unfamiliar notation, map it back to the concepts here.
3. Experiment. Fine-tune an open-weight model (Qwen, LLaMA, Mistral) with LoRA on a task you care about. Set up a RAG pipeline. Run ablations โ change one thing at a time and observe the effect. The intuition you build from experimentation is irreplaceable.
4. Stay current. Follow ArXiv (cs.CL, cs.LG), read ML Twitter/X threads, watch conference talks (NeurIPS, ICML, ACL, EMNLP). The field moves fast, but with solid foundations, new papers become incremental extensions of what you already know.
"The only way to learn mathematics is to do mathematics." โ Paul Halmos
The same is true for machine learning. The equations in these pages are not the destination โ they're the map. The territory is in the code, the experiments, and the papers.