Companion Volume III โ€” Frontier Topics & Complete Solutions

ML First Principles, Part III:
Advanced Topics & Practice

Efficient attention, parameter-efficient fine-tuning, quantization, reasoning models,
multi-modal architectures, RAG, evaluation โ€” and solutions to every exercise.

Contents

  1. Efficient Attention & Long Context
    • Flash Attention ยท Sliding Window ยท Linear Attention ยท RoPE Scaling
  2. Parameter-Efficient Fine-Tuning
    • Full fine-tuning ยท LoRA derivation ยท QLoRA ยท Adapters ยท Prefix Tuning
  3. Quantization
    • INT8 ยท GPTQ ยท AWQ ยท GGUF ยท Why 4-bit works
  4. Reasoning Models & Chain-of-Thought
    • CoT prompting ยท Self-consistency ยท Tree of Thought ยท o1/o3 ยท DeepSeek-R1 ยท Test-time compute
  5. Retrieval-Augmented Generation
    • Architecture ยท Embedding models ยท Vector databases ยท Chunking ยท Reranking
  6. Multi-Modal Models
    • Vision Transformers ยท CLIP ยท LLaVA ยท Visual tokenization
  7. Evaluation & Benchmarks
    • Perplexity ยท BLEU/ROUGE ยท LLM benchmarks ยท Contamination ยท Human eval
  8. Frontier Concepts
    • Distillation ยท Speculative decoding ยท Constitutional AI ยท State-space models
  9. Complete Exercise Solutions
    • All exercises from Parts I, II, and III with detailed solutions
Advanced Topic I

Efficient Attention & Long Context

The Quadratic Bottleneck

Standard self-attention computes an \(n \times n\) attention matrix, where \(n\) is the sequence length. Both memory and computation are \(O(n^2 d)\). For a 128K token context at \(d = 4096\): the attention matrix alone has \(128K^2 = 16.4\) billion entries per head. This is the fundamental scaling challenge.

Flash Attention โ€” Exact Attention, Hardware-Aware

Dao et al. (2022, 2023) โ€” "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" and "FlashAttention-2." Reformulated attention to be tiling-friendly, reducing memory from \(O(n^2)\) to \(O(n)\) without any approximation.
GPUs have a hierarchy of memory: HBM (large, slow) and SRAM (small, fast). Standard attention materializes the full \(n \times n\) matrix in HBM, then reads it back for softmax โ€” this memory transfer is the bottleneck, not the arithmetic. Flash Attention computes attention in tiles that fit in SRAM, never materializing the full matrix.

The Key Trick: Online Softmax

The challenge is that softmax requires the maximum value across the entire row (for numerical stability) and the sum of all exponentials (for normalization). Flash Attention uses an online algorithm that processes blocks incrementally:

Online Softmax Accumulation $$m^{(j)} = \max(m^{(j-1)}, \max(\mathbf{S}^{(j)}))$$ $$\ell^{(j)} = e^{m^{(j-1)} - m^{(j)}} \ell^{(j-1)} + \text{rowsum}(e^{\mathbf{S}^{(j)} - m^{(j)}})$$ $$\mathbf{O}^{(j)} = e^{m^{(j-1)} - m^{(j)}} \mathbf{O}^{(j-1)} + e^{\mathbf{S}^{(j)} - m^{(j)}} \mathbf{V}^{(j)}$$

Here \(m\) tracks the running maximum, \(\ell\) tracks the running softmax denominator, and \(\mathbf{O}\) accumulates the unnormalized output. After processing all blocks, the final output is \(\mathbf{O} / \ell\). The key insight: the correction factor \(e^{m^{(j-1)} - m^{(j)}}\) adjusts previous accumulations when a new block reveals a larger maximum.

Result: 2โ€“4ร— faster, \(O(n)\) memory instead of \(O(n^2)\), and the answer is mathematically identical to standard attention.

Sliding Window Attention (Mistral/Gemma-style)

Instead of attending to all previous tokens, each layer only attends to the most recent \(W\) tokens:

$$\text{Attention}(q_t, K, V) = \text{softmax}\!\left(\frac{q_t K_{[t-W:t]}^T}{\sqrt{d_k}}\right) V_{[t-W:t]}$$

With \(L\) layers and window size \(W\), the effective receptive field is \(L \times W\) tokens โ€” similar to how stacked CNN layers build large receptive fields from small kernels. Mistral 7B uses \(W = 4096\) with \(L = 32\), giving an effective field of 128K tokens.

RoPE Context Extension

Models trained with a maximum sequence length can be extended through RoPE scaling techniques:

Position Interpolation (linear scaling) $$\theta'_i = \theta_i / s \quad \text{where } s = \frac{L_{\text{target}}}{L_{\text{trained}}}$$

YaRN (Yet another RoPE extensioN) applies different scaling factors to different frequency bands โ€” high-frequency components (which encode fine-grained position) are scaled more aggressively than low-frequency components (which encode coarse position). This preserves local position information while extending the overall range.

(a) Standard attention for sequence length 32K with \(d = 128\) per head: how many FLOPs for the \(\mathbf{QK}^T\) multiplication? How many bytes to store the attention matrix in FP16?
(b) With sliding window attention (\(W = 4096\)), what fraction of the full attention FLOPs do you use?
(c) Why does Flash Attention not help with the computational cost of attention, only the memory cost?
ยท ยท ยท
Advanced Topic II

Parameter-Efficient Fine-Tuning

The Fine-Tuning Landscape

Full Fine-Tuning LoRA Prompt Tuning โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Update ALL โ”‚ โ”‚ Freeze base โ”‚ โ”‚ Freeze ALL โ”‚ โ”‚ parameters โ”‚ โ”‚ Add small โ”‚ โ”‚ Prepend soft โ”‚ โ”‚ โ”‚ โ”‚ rank updates โ”‚ โ”‚ tokens only โ”‚ โ”‚ 7B params โ”‚ โ”‚ ~0.1% params โ”‚ โ”‚ ~0.001% โ”‚ โ”‚ ~28GB VRAM โ”‚ โ”‚ ~6GB VRAM โ”‚ โ”‚ minimal VRAM โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ Best quality Great tradeoff Limited capacity

LoRA โ€” Low-Rank Adaptation (Full Derivation)

Hu et al. (2021) โ€” "LoRA: Low-Rank Adaptation of Large Language Models." Showed that the weight updates during fine-tuning have low intrinsic rank โ€” you can approximate them with tiny low-rank matrices and achieve comparable quality to full fine-tuning.
During fine-tuning, you're changing a huge matrix \(\mathbf{W}\) to \(\mathbf{W} + \Delta\mathbf{W}\). The key insight is that \(\Delta\mathbf{W}\) typically has low rank โ€” it lives in a small subspace. Instead of storing the full \(\Delta\mathbf{W} \in \mathbb{R}^{d \times d}\), we decompose it into two small matrices.

The Math

For a weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\), instead of updating to \(\mathbf{W}_0 + \Delta\mathbf{W}\), we parametrize:

LoRA Decomposition $$\mathbf{W} = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$ $$\text{where } \mathbf{B} \in \mathbb{R}^{d \times r}, \; \mathbf{A} \in \mathbb{R}^{r \times k}, \; r \ll \min(d, k)$$

The forward pass becomes:

$$\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}$$

Where \(\alpha\) is a scaling factor (controls the magnitude of the LoRA update relative to the pretrained weights).

Parameter Savings

For a model dimension \(d = 4096\), a weight matrix \(\mathbf{W} \in \mathbb{R}^{4096 \times 4096}\):

Full fine-tuning: \(4096 \times 4096 = 16.8M\) parameters per matrix.
LoRA with \(r = 16\): \(\mathbf{B}: 4096 \times 16 = 65K\) + \(\mathbf{A}: 16 \times 4096 = 65K\) = 131K parameters per matrix.

That's \(131K / 16.8M = 0.78\%\) of the original parameters. For all Q, K, V, and output projections across 32 layers: \(131K \times 4 \times 32 = 16.8M\) trainable parameters out of 7B total = 0.24%.

Initialization

\(\mathbf{A}\) is initialized with a random Gaussian, \(\mathbf{B}\) is initialized to zero. This means \(\Delta\mathbf{W} = \mathbf{BA} = 0\) at the start of training โ€” the model begins exactly as the pretrained model, and the LoRA update grows gradually from zero.

Merging

After training, you can merge LoRA weights back into the base model: \(\mathbf{W}_{\text{merged}} = \mathbf{W}_0 + \frac{\alpha}{r}\mathbf{BA}\). The merged model has zero additional inference cost โ€” it's the same size and speed as the original.

QLoRA โ€” Quantized LoRA

Dettmers et al. (2023) โ€” "QLoRA: Efficient Finetuning of Quantized Language Models." Fine-tune a 65B model on a single 48GB GPU by combining 4-bit quantization of the base model with LoRA adapters in 16-bit.

Three innovations: (1) 4-bit NormalFloat (NF4) โ€” a quantization type optimal for normally-distributed weights. (2) Double quantization โ€” quantize the quantization constants themselves. (3) Paged optimizers โ€” use CPU memory to handle GPU memory spikes.

Other PEFT Methods

MethodWhat's Trainable% ParamsQuality
Full fine-tuningEverything100%Best (if enough data)
LoRALow-rank updates to attention0.1โ€“1%Near-full quality
QLoRASame as LoRA, base in 4-bit0.1โ€“1%Slightly below LoRA
Prefix tuningSoft prefix tokens per layer<0.1%Good for specific tasks
AdaptersSmall bottleneck layers inserted1โ€“5%Good
BitFitOnly bias terms<0.1%Surprisingly decent
ยท ยท ยท
Advanced Topic III

Quantization โ€” Running LLMs on Consumer Hardware

A 70B parameter model in FP16 requires 140GB of memory โ€” far more than any consumer GPU. Quantization reduces each parameter from 16 bits to 8, 4, or even 2 bits, shrinking the model proportionally. The question is: how much quality do you lose?

The Math of Quantization

Linear quantization maps a floating-point value to a discrete integer:

Symmetric Quantization $$x_q = \text{round}\!\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{2^{b-1} - 1}$$ $$\hat{x} = x_q \cdot s \quad \text{(dequantize for computation)}$$

Where \(s\) is the scale factor and \(b\) is the bit-width. For 8-bit: values map to \([-127, 127]\). For 4-bit: values map to \([-7, 7]\).

Why LLM Weights Tolerate Quantization

Neural network weights tend to follow a roughly Gaussian distribution centered near zero, with most values small and rare values large. The large outliers are the problem โ€” they force a wide scale factor \(s\), reducing resolution for the many small values. Modern methods handle this with:

Per-channel quantization: separate scale per output channel (row of weight matrix), so one outlier channel doesn't affect others.

GPTQ (Frantar et al., 2022): Post-training quantization that iteratively quantizes weights while compensating for the quantization error using second-order (Hessian) information. Quantizes one column at a time, adjusting remaining columns to minimize the output error.

AWQ (Lin et al., 2023): Observes that 1% of weights (the salient ones corresponding to important activations) matter much more than the rest. Protects these salient channels with per-channel scaling before quantization.

Memory Savings

PrecisionBits/param7B Model70B Model
FP323228 GB280 GB
FP16/BF161614 GB140 GB
INT887 GB70 GB
INT4 (GPTQ/AWQ)43.5 GB35 GB
2-bit (experimental)21.75 GB17.5 GB

At 4-bit quantization, a 70B model fits on a single A100-80GB with room for KV cache. A 7B model fits on consumer GPUs with 6GB+ VRAM. Quality loss at INT4 is typically 1โ€“3% on benchmarks.

ยท ยท ยท
Advanced Topic IV

Reasoning Models & Chain-of-Thought

Chain-of-Thought Prompting

Wei et al. (2022) โ€” "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Showed that simply adding "Let's think step by step" or providing step-by-step examples dramatically improves reasoning performance.
An LLM predicting the answer to "What is 17 ร— 24?" in a single forward pass must compute the answer in ~100 layer operations. But if it generates intermediate steps ("17 ร— 20 = 340, 17 ร— 4 = 68, 340 + 68 = 408"), each step is a much simpler computation, and the answer from each step feeds into the next through the context.

Why CoT works mathematically: Each generated token triggers a full forward pass through the network. Chain-of-thought effectively gives the model \(O(T)\) serial compute steps (where \(T\) is the number of reasoning tokens) instead of just the fixed depth of the network. It transforms a bounded-depth computation into an arbitrarily long sequential one.

Self-Consistency

Wang et al. (2022) โ€” "Self-Consistency Improves Chain of Thought Reasoning." Sample multiple reasoning paths and take a majority vote on the final answer. Diverse reasoning paths that converge on the same answer are more likely correct.
Self-Consistency $$\hat{a} = \arg\max_a \sum_{i=1}^{K} \mathbb{1}[a_i = a]$$

Generate \(K\) different reasoning chains (using temperature sampling), extract the final answer from each, and take the majority vote. This exploits the fact that errors in reasoning tend to be diverse (different chains fail differently), while correct reasoning converges.

Test-Time Compute Scaling

OpenAI o1/o3 (2024โ€“2025) and DeepSeek-R1 (2025) โ€” Models trained with reinforcement learning to generate extended internal reasoning chains. These models spend more computation at inference time, trading tokens for accuracy.

The Scaling Law of Thought

Traditional scaling increases training compute (more parameters, more data). Reasoning models introduce a second axis: inference compute. The longer the model "thinks" (generates internal reasoning tokens), the better it performs on hard problems.

Test-Time Compute Scaling (empirical) $$\text{Accuracy}(C_{\text{test}}) \approx A - B \cdot C_{\text{test}}^{-\alpha}$$

Where \(C_{\text{test}}\) is the number of inference tokens. Like training scaling laws, performance improves as a power law of compute โ€” but now the compute is spent at inference time.

DeepSeek-R1: RL for Reasoning

DeepSeek-R1 trains the model to produce long chains of thought using reinforcement learning with a simple reward: correctness of the final answer. The model learns to decompose problems, verify intermediate steps, backtrack when stuck, and allocate more tokens to harder sub-problems.

Key innovation โ€” emergent behaviors: The model spontaneously learns strategies like self-verification ("let me check this"), exploration of alternatives ("alternatively, I could..."), and reflection ("wait, that doesn't seem right") โ€” all from the simple reward signal of getting the final answer correct. No explicit instruction to reason this way was provided.

Tree of Thought

Yao et al. (2023) โ€” "Tree of Thoughts." Generalizes chain-of-thought from a single chain to a tree โ€” the model explores multiple reasoning branches, evaluates them, and can backtrack.
Chain-of-Thought: Tree of Thoughts: Problem Problem โ”‚ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ” Step 1 Step 1a 1b 1c โ”‚ โ”‚ โ”‚ โ–ผ โ”Œโ”€โ”ผโ”€โ” โ”Œโ”€โ”ผโ”€โ” Step 2 2a 2b 2c 2d โ”‚ โœ— โ”‚ โ”‚ โœ— โ–ผ โ–ผ โ–ผ Answer Ans. Ans. (vote)
ยท ยท ยท
Advanced Topic V

Retrieval-Augmented Generation

Lewis et al. (2020) โ€” "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Combined a retriever (finds relevant documents) with a generator (produces the answer), allowing LLMs to access information beyond their training data.
An LLM's knowledge is frozen at training time. RAG gives it a "reference library" โ€” before answering, the model retrieves relevant documents and includes them in its context. This reduces hallucinations, enables up-to-date information, and provides citations.

The RAG Pipeline

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ User โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Embed โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Vector โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Top-k โ”‚ โ”‚ Query โ”‚ โ”‚ query โ”‚ โ”‚ similarity โ”‚ โ”‚ chunks โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ search โ”‚ โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ LLM โ”‚โ†โ”€โ”€โ”€โ”€โ”‚ Augment โ”‚ โ”‚ generates โ”‚ โ”‚ prompt โ”‚ โ”‚ answer โ”‚ โ”‚ with โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ chunks โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Step 1: Document Processing

Split documents into chunks (typically 256โ€“512 tokens with overlap), then embed each chunk using a sentence embedding model:

$$\mathbf{e}_i = f_{\text{embed}}(\text{chunk}_i) \in \mathbb{R}^d$$

Popular embedding models: E5, BGE, Cohere embed-v3, OpenAI text-embedding-3. Store the embedding vectors in a vector database (FAISS, Pinecone, Milvus, Chroma).

Step 2: Retrieval

Embed the user query with the same model, then find the nearest chunks by cosine similarity:

$$\text{score}(q, c_i) = \frac{\mathbf{e}_q^T \mathbf{e}_{c_i}}{\|\mathbf{e}_q\| \|\mathbf{e}_{c_i}\|}$$

Return the top-\(k\) chunks (typically \(k = 3\text{โ€“}10\)).

Step 3: Augmented Generation

Construct a prompt like: [System: Use the following context to answer. Context: {chunk1} {chunk2} ...] [User: {query}]

Advanced RAG Techniques

Reranking: After initial retrieval, use a cross-encoder model (which sees query + document jointly, not just embeddings) to re-score and reorder results. Much more accurate than embedding-only similarity but too slow for the initial search.

Hybrid search: Combine semantic search (embeddings) with lexical search (BM25/TF-IDF) for better recall. Semantic search finds paraphrases; lexical search catches exact keywords.

Query transformation: Rewrite the query to improve retrieval โ€” decompose complex questions, expand abbreviations, generate hypothetical answer passages (HyDE).

ยท ยท ยท
Advanced Topic VI

Multi-Modal Models

Vision Transformers (ViT)

Dosovitskiy et al. (2020) โ€” "An Image Is Worth 16x16 Words." Applied the transformer directly to images by splitting them into patches, treating each patch as a "token." Proved that pure transformer architectures match or beat CNNs for vision tasks when trained at sufficient scale.

How ViT Works

Image โ†’ Sequence of Tokens $$\text{Image} \in \mathbb{R}^{H \times W \times C} \longrightarrow \text{Patches} \in \mathbb{R}^{N \times (P^2 \cdot C)} \xrightarrow{\mathbf{W}_{\text{patch}}} \text{Tokens} \in \mathbb{R}^{N \times d}$$

A 224ร—224 image with 16ร—16 patches gives \(N = (224/16)^2 = 196\) tokens. Each patch is linearly projected to dimension \(d\), then processed by a standard transformer encoder. Classification uses a special [CLS] token prepended to the sequence.

CLIP โ€” Connecting Vision and Language

Radford et al. (2021) โ€” "Learning Transferable Visual Models From Natural Language Supervision." Trained a vision encoder and text encoder jointly on 400M image-text pairs from the internet, creating a shared embedding space where images and text are directly comparable.

Contrastive Learning Objective

CLIP Loss (for a batch of N image-text pairs) $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(\mathbf{v}_i^T\mathbf{t}_i / \tau)}{\sum_{j=1}^{N}\exp(\mathbf{v}_i^T\mathbf{t}_j / \tau)} + \log\frac{\exp(\mathbf{t}_i^T\mathbf{v}_i / \tau)}{\sum_{j=1}^{N}\exp(\mathbf{t}_i^T\mathbf{v}_j / \tau)}\right]$$

Where \(\mathbf{v}_i\) is the image embedding, \(\mathbf{t}_i\) is the text embedding, and \(\tau\) is a learned temperature parameter. The loss maximizes similarity between matched pairs and minimizes similarity between unmatched pairs.

Vision-Language Models (LLaVA Pattern)

Modern vision-language models follow a simple recipe:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Image โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Vision โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚ Projection โ”‚ โ”‚ โ”‚ โ”‚ Encoder โ”‚ โ”‚ (MLP or linear) โ”‚ โ”‚ โ”‚ โ”‚ (ViT/CLIP) โ”‚ โ”‚ โ†’ visual tokens โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ–ผ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”‚ Text โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’โ”‚ LLM Decoder โ”‚ โ”‚ tokens โ”‚ โ”‚ [visual tokens] โ”‚ โ”‚ โ”‚ โ”‚ [text tokens] โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚ โ†’ response โ”‚ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

The vision encoder extracts visual features, a projection layer maps them into the LLM's embedding space as "visual tokens," and the LLM processes both visual and text tokens with its standard self-attention mechanism. From the LLM's perspective, image features look just like text tokens.

ยท ยท ยท
Advanced Topic VII

Evaluation & Benchmarks

Perplexity โ€” The Fundamental LM Metric

Perplexity $$\text{PPL} = \exp\!\left(-\frac{1}{T}\sum_{t=1}^{T}\log P(x_t \mid x_{

Perplexity is the exponential of the average cross-entropy loss. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each step. Lower is better. A perfectly confident model would have perplexity 1.

Generation Metrics

MetricFormula IntuitionBest For
BLEUPrecision of n-gram overlap with referenceMachine translation
ROUGE-LLongest common subsequence with referenceSummarization
BERTScoreCosine similarity of BERT embeddings per tokenSemantic similarity
METEORHarmonic mean of precision & recall with synonymsTranslation (improved)
N-gram metrics are poor evaluators of LLMs. BLEU and ROUGE only measure surface-level overlap. A paraphrase that perfectly captures the meaning but uses different words scores poorly. Modern LLM evaluation increasingly relies on human evaluation and LLM-as-judge methods (using a stronger model to evaluate a weaker model's outputs).

LLM Benchmarks

BenchmarkWhat It TestsFormat
MMLUBroad knowledge across 57 subjectsMultiple choice
HumanEval / MBPPCode generation correctnessFunction completion โ†’ unit tests
GSM8KGrade-school math reasoningWord problems โ†’ numerical answer
MATHCompetition-level mathHard problems โ†’ solutions
ARC-ChallengeScience reasoningMultiple choice
HellaSwagCommon sense completionChoose most plausible continuation
TruthfulQAResistance to common misconceptionsOpen-ended generation
MT-Bench / Chatbot ArenaOverall conversational qualityLLM-as-judge / human preference
The Benchmark Contamination Problem: If benchmark data appears in the training corpus, the model has "seen the test." This is increasingly problematic as training datasets grow to encompass most of the internet. Solutions include: dynamic benchmarks that change over time, private held-out test sets, and evaluating on tasks that require reasoning rather than recall.
ยท ยท ยท
Advanced Topic VIII

Frontier Concepts

Knowledge Distillation

Train a smaller "student" model to mimic a larger "teacher" model by matching the teacher's output probability distribution, not just the hard labels:

Distillation Loss $$\mathcal{L}_{\text{distill}} = (1-\alpha)\,\mathcal{L}_{\text{CE}}(y, p_S) + \alpha\,T^2 \cdot D_{\text{KL}}(p_T^{(T)} \| p_S^{(T)})$$

Where \(p_T^{(T)}\) and \(p_S^{(T)}\) are the teacher and student distributions at temperature \(T\). Higher temperature makes the distributions softer, revealing more structure (the relative probabilities of wrong answers contain information about similarity between classes).

Speculative Decoding

LLM inference is memory-bandwidth-bound, not compute-bound โ€” the GPU sits idle while loading weights from memory. Speculative decoding uses a small, fast "draft" model to generate several candidate tokens, then verifies them all at once with the large model. If the draft model is correct (which it often is for easy tokens), you get multiple tokens from one large-model forward pass.
Acceptance probability $$P(\text{accept}) = \min\!\left(1, \frac{p_{\text{target}}(x)}{p_{\text{draft}}(x)}\right)$$

Tokens where the draft model is more confident than the target model are always accepted. Tokens where it's less confident are accepted with probability equal to the ratio. This guarantees the output distribution is identical to the target model โ€” speculative decoding is an exact optimization, not an approximation.

State-Space Models (SSMs)

Gu & Dao (2023) โ€” "Mamba: Linear-Time Sequence Modeling with Selective State Spaces." Proposed a selective state-space model that processes sequences in \(O(n)\) time (vs \(O(n^2)\) for attention) while matching transformer quality on language tasks up to moderate scale.

The SSM Equation

Continuous State-Space Model $$\mathbf{h}'(t) = \mathbf{A}\mathbf{h}(t) + \mathbf{B}x(t)$$ $$y(t) = \mathbf{C}\mathbf{h}(t) + Dx(t)$$

Discretized for sequences (with step size \(\Delta\)): \(\mathbf{h}_t = \bar{\mathbf{A}}\mathbf{h}_{t-1} + \bar{\mathbf{B}}x_t\). This is a linear recurrence โ€” it can be computed either recurrently (\(O(n)\) sequential) or as a convolution (\(O(n \log n)\) parallel).

Mamba's key innovation โ€” selective mechanism: Making \(\mathbf{B}\), \(\mathbf{C}\), and \(\Delta\) input-dependent (functions of \(x_t\)) allows the model to selectively remember or forget information, similar to LSTM gates but within the SSM framework. This breaks the linearity needed for convolution mode, so Mamba uses a hardware-aware scan algorithm instead.

Constitutional AI (CAI)

Bai et al. (2022) โ€” "Constitutional AI: Harmlessness from AI Feedback." Instead of relying entirely on human feedback for alignment, provide the model with a set of principles ("constitution") and use AI-generated feedback: the model critiques its own outputs according to the constitution, then revises them.

The RLAIF pipeline: (1) Generate responses. (2) Ask the model to critique responses against principles. (3) Ask the model to revise responses. (4) Train a preference model on (original, revised) pairs. (5) Use this as the reward model in standard RL training.

ยท ยท ยท
Reference

Complete Exercise Solutions

Solutions to all exercises from all three parts, in order.

Part I โ€” Mathematical Foundations

ยง1.1 Linear Algebra

(a) \(\mathbf{w}^T\mathbf{x}_1 = 0.5(4) + (-1.0)(2) = 2.0 - 2.0 = 0.0\)
\(\mathbf{w}^T\mathbf{x}_2 = 0.5(1) + (-1.0)(5) = 0.5 - 5.0 = -4.5\)
The weight vector "prefers" \(\mathbf{x}_1\) (score 0.0 vs -4.5) โ€” or more precisely, \(\mathbf{x}_1\) is less negatively aligned with \(\mathbf{w}\). In a classifier context, \(\mathbf{x}_1\) is closer to the positive side of the decision boundary.

(b) \(\mathbf{x}_1 \cdot \mathbf{x}_2 = 4(1) + 2(5) = 14\)
\(\|\mathbf{x}_1\| = \sqrt{16 + 4} = \sqrt{20} \approx 4.47\)
\(\|\mathbf{x}_2\| = \sqrt{1 + 25} = \sqrt{26} \approx 5.10\)
\(\cos\theta = 14 / (4.47 \times 5.10) \approx 14 / 22.8 \approx 0.614\)
Moderately similar โ€” the angle between them is about 52ยฐ.

ยง1.3 Optimization

(a) \(f(\theta) = (\theta - 3)^2\), so \(\nabla f = 2(\theta - 3)\).
Step 0: \(\theta_0 = 0\), gradient = \(2(0-3) = -6\), update: \(\theta_1 = 0 - 0.1(-6) = 0.6\)
Step 1: \(\theta_1 = 0.6\), gradient = \(2(0.6-3) = -4.8\), update: \(\theta_2 = 0.6 + 0.48 = 1.08\)
Step 2: \(\theta_2 = 1.08\), gradient = \(2(1.08-3) = -3.84\), update: \(\theta_3 = 1.08 + 0.384 = 1.464\)
Moving toward the minimum at \(\theta = 3\), but slowly with \(\eta = 0.1\).

(b) \(y = \sigma(wx + b)\). Let \(z = wx + b\).
\(\frac{\partial y}{\partial w} = \frac{\partial y}{\partial z} \cdot \frac{\partial z}{\partial w} = \sigma(z)(1 - \sigma(z)) \cdot x\)
Or substituting: \(\frac{\partial y}{\partial w} = y(1-y) \cdot x\).

Part I โ€” Core Machine Learning

ยง2.2 Logistic Regression

(a) \(z = w_{\text{free}} x_{\text{free}} + w_{\text{meeting}} x_{\text{meeting}} + b\)
\(z = 2.0(1) + (-1.5)(1) + (-0.5) = 2.0 - 1.5 - 0.5 = 0.0\)
\(P(\text{spam}) = \sigma(0.0) = 0.5\).
The model is perfectly uncertain โ€” the positive signal from "free" is exactly cancelled by "meeting" and the bias.

(b) With \(y = 1\) and \(\hat{p} = 0.5\):
\(\mathcal{L} = -[1 \cdot \log(0.5) + 0 \cdot \log(0.5)] = -\log(0.5) = \ln 2 \approx 0.693\)
This is the maximum possible loss for a single example โ€” the model is at maximum uncertainty.

(c) \(\frac{\partial \mathcal{L}}{\partial w_{\text{free}}} = (\hat{p} - y) \cdot x_{\text{free}} = (0.5 - 1)(1) = -0.5\)
Update: \(w_{\text{free}}' = 2.0 - 0.1(-0.5) = 2.0 + 0.05 = 2.05\)
The weight for "free" increases, making the model slightly more inclined to classify emails with "free" as spam.

Part I โ€” Neural Networks

ยง3.3 Backpropagation

From the forward pass: \(a^{[2]} = 0.59\), \(y = 1\), \(\mathbf{a}^{[1]} = [0.25, 0.40]^T\), \(\mathbf{x} = [1, 0.5]^T\).
\(\mathbf{W}^{[2]} = [0.5, 0.6]\).

(a) \(\delta^{[2]} = 0.59 - 1 = -0.41\)

(b) \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} \cdot (\mathbf{a}^{[1]})^T = -0.41 \times [0.25, 0.40] = [-0.1025, -0.164]\)

(c) \(\boldsymbol{\delta}^{[1]} = (\mathbf{W}^{[2]})^T \delta^{[2]} \odot \text{ReLU}'(\mathbf{z}^{[1]})\)
\(= \begin{bmatrix}0.5\\0.6\end{bmatrix} \times (-0.41) \odot \begin{bmatrix}1\\1\end{bmatrix} = \begin{bmatrix}-0.205\\-0.246\end{bmatrix}\)

(d) With \(\eta = 0.1\):
\(W_1^{[2]}: 0.5 - 0.1(-0.1025) = 0.5103\)
\(W_2^{[2]}: 0.6 - 0.1(-0.164) = 0.6164\)
\(W_{11}^{[1]}: 0.1 - 0.1(-0.205)(1) = 0.1205\)
\(W_{12}^{[1]}: 0.3 - 0.1(-0.205)(0.5) = 0.3103\)
\(W_{21}^{[1]}: 0.2 - 0.1(-0.246)(1) = 0.2246\)
\(W_{22}^{[1]}: 0.4 - 0.1(-0.246)(0.5) = 0.4123\)
All weights increase (since \(\delta < 0\) means the model under-predicted), pushing the output closer to 1.

Part I โ€” Deep Learning Architectures

ยง4.2 RNNs & LSTMs

(a) If forget gate \(\mathbf{f}_t = \mathbf{1}\) and input gate \(\mathbf{i}_t = \mathbf{0}\):
\(\mathbf{c}_t = \mathbf{1} \odot \mathbf{c}_{t-1} + \mathbf{0} \odot \tilde{\mathbf{c}}_t = \mathbf{c}_{t-1}\)
The cell state is perfectly preserved โ€” nothing is forgotten, nothing new is added. This is the "information highway" mode.

Reverse (\(\mathbf{f}_t = \mathbf{0}\), \(\mathbf{i}_t = \mathbf{1}\)):
\(\mathbf{c}_t = \mathbf{0} \odot \mathbf{c}_{t-1} + \mathbf{1} \odot \tilde{\mathbf{c}}_t = \tilde{\mathbf{c}}_t\)
Previous memory is completely erased and replaced with the new candidate. This is a "reset" mode.

(b) An additive update means \(\frac{\partial \mathbf{c}_t}{\partial \mathbf{c}_{t-1}} = \mathbf{f}_t\) (just the forget gate, an element-wise scaling). If \(\mathbf{f}_t \approx 1\), the gradient flows through unchanged โ€” no vanishing!
A multiplicative update like \(\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} \odot \tilde{\mathbf{c}}_t\) would make the gradient a product of many terms, leading to vanishing/exploding. The additive structure is the key insight.

Part I โ€” NLP Foundations

ยง5.1 Word Embeddings

(a) The equation \(\vec{\text{king}} - \vec{\text{man}} + \vec{\text{woman}} \approx \vec{\text{queen}}\) implies that semantic relationships are encoded as linear directions in the embedding space. Specifically:
\(\vec{\text{king}} - \vec{\text{queen}} \approx \vec{\text{man}} - \vec{\text{woman}}\)
This means there exists a "gender direction" that is approximately consistent across different word pairs. Individual dimensions don't correspond to single concepts โ€” rather, directions (linear combinations of dimensions) encode semantic attributes like gender, royalty, plurality, tense, etc.

(b) Using the same matrix for center and context embeddings would force \(\mathbf{v}_w^T \mathbf{v}_w\) (the self-similarity) to always be the squared norm โ€” it can't be negative. This means the model can't learn that a word is a poor predictor of itself as context (which is usually the case โ€” "the" rarely predicts another "the" nearby). Two separate matrices give more flexibility: the center and context roles are genuinely different, and having separate embeddings lets the model capture these different roles.

Part I โ€” Attention & Transformers

ยง6.4 Transformer Architecture

(a) With \(d = 256\) and \(h = 4\) heads: \(d_k = d_v = d/h = 64\).
Each \(\mathbf{W}_i^Q, \mathbf{W}_i^K, \mathbf{W}_i^V \in \mathbb{R}^{256 \times 64}\).

(b) Causal mask for sequence length 4:
\(\mathbf{M} = \begin{bmatrix}0 & -\infty & -\infty & -\infty\\0 & 0 & -\infty & -\infty\\0 & 0 & 0 & -\infty\\0 & 0 & 0 & 0\end{bmatrix}\)
After softmax, row \(i\) of the attention matrix is a probability distribution over positions \(0, 1, \ldots, i\). Each row represents: "given that I'm at position \(i\), how much should I attend to each past position (including myself)?"

(c) Residual connections let the gradient flow directly through the identity path: \(\frac{\partial (\mathbf{x} + f(\mathbf{x}))}{\partial \mathbf{x}} = \mathbf{I} + \frac{\partial f}{\partial \mathbf{x}}\). Even if \(\frac{\partial f}{\partial \mathbf{x}}\) is small (vanishing), the identity term \(\mathbf{I}\) guarantees a gradient of at least 1. Without residual connections, a 96-layer transformer would be untrainable due to vanishing gradients; with them, gradients can skip directly from the loss to any layer.

Part I โ€” Modern LLMs

ยง7.4โ€“7.5 MoE and RLHF

(a) With 64 experts and top-2 routing: \(2/64 = 3.125\%\) of parameters active per token. This is beneficial because: (1) Training: total model capacity is huge (enabling more knowledge storage), but each gradient update is cheap (only updating 2 experts + shared components). (2) Inference: each forward pass only needs to load and compute 2 experts' worth of weights, dramatically reducing latency and memory bandwidth requirements.

(b) If \(\beta \to 0\): the KL constraint vanishes. The model optimizes reward without restriction, likely leading to "reward hacking" โ€” finding degenerate outputs that score high on the imperfect reward model but are nonsensical or harmful.
If \(\beta \to \infty\): the KL term dominates. The model is forced to stay extremely close to the reference policy and barely changes. Alignment has almost no effect.
Good values of \(\beta\) balance these extremes.

(c) A model trained only on next-token prediction learns to predict what text looks like on the internet โ€” including toxic, biased, unhelpful, or harmful text. If asked "How do I hotwire a car?", the pretrained model predicts the most likely continuation, which is a direct answer. RLHF teaches the model that the preferred response is to decline or redirect. More generally, RLHF shifts the model from "what would the internet say?" to "what would a helpful, harmless, honest assistant say?"

Part III โ€” Advanced Topics

ยงI Efficient Attention

(a) \(\mathbf{QK}^T\) for sequence length \(n = 32768\), \(d_k = 128\):
FLOPs: \(2 \times n \times n \times d_k = 2 \times 32768^2 \times 128 \approx 2.75 \times 10^{11}\) FLOPs per head.
Memory for attention matrix in FP16: \(n^2 \times 2\) bytes = \(32768^2 \times 2 = 2.15\) GB per head. With 32 heads, that's ~69 GB just for attention matrices โ€” clearly the bottleneck.

(b) Sliding window (\(W = 4096\)): each token only attends to \(W\) tokens instead of \(n\).
FLOPs ratio: \(W/n = 4096/32768 = 12.5\%\) of the full attention cost.

(c) Flash Attention computes the exact same \(\mathbf{QK}^T\) and \(\text{softmax}(...)\mathbf{V}\) โ€” it doesn't reduce the number of arithmetic operations. It reduces memory IO: by tiling the computation to fit in fast SRAM, it avoids the expensive round-trip of writing the \(n \times n\) attention matrix to slow HBM and reading it back. The speedup comes entirely from better memory access patterns, not fewer FLOPs. (In practice, modern GPUs are memory-bandwidth-bound for attention, so reducing IO is more impactful than reducing FLOPs.)

ยงIV LLM Training Pipeline

(a) \(\text{Parameters} \approx 12 \times L \times d^2 = 12 \times 32 \times 4096^2 = 12 \times 32 \times 16.78M = 6.44B\).
LLaMA-7B has ~6.7B parameters. Our estimate of 6.44B is very close โ€” the \(12Ld^2\) formula accounts for: 4 attention projections (\(Q, K, V, O\) each \(d \times d\)), 2 FFN projections (\(d \times 4d\) and \(4d \times d\)), plus biases and layer norms (small). The slight undercount is because the formula ignores the embedding layer (\(V \times d\) where \(V\) is vocab size).

(b) Gradient updates = total tokens / batch size = \(2 \times 10^{12} / (4 \times 10^6) = 500{,}000\) steps.
At typical training throughput, this might take 20โ€“30 days on a 1024-GPU cluster.

(c) BF16 has 8 exponent bits (same as FP32) but only 7 mantissa bits (vs 23 for FP32 and 10 for FP16). FP16 has only 5 exponent bits, giving a range of \([6 \times 10^{-8}, 65504]\). Gradients often exceed 65504 (overflow โ†’ NaN) or fall below \(6 \times 10^{-8}\) (underflow โ†’ 0). BF16's range is \([10^{-38}, 3.4 \times 10^{38}]\), matching FP32. You lose some precision, but precision matters less than range for gradient accumulation.
ยท ยท ยท

What's Next?

You now have a complete mathematical and conceptual foundation spanning from dot products to frontier LLM architectures. The path forward is hands-on:

1. Implement. Build a transformer from scratch in PyTorch. Karpathy's nanoGPT and minbpe are the gold standard starting points. The gap between understanding the math and writing the code is where real understanding forms.

2. Read papers. Start with the annotated reading list from Part I. With the mathematical foundation from these three volumes, you should be able to follow the key equations in any modern NLP paper. When you encounter unfamiliar notation, map it back to the concepts here.

3. Experiment. Fine-tune an open-weight model (Qwen, LLaMA, Mistral) with LoRA on a task you care about. Set up a RAG pipeline. Run ablations โ€” change one thing at a time and observe the effect. The intuition you build from experimentation is irreplaceable.

4. Stay current. Follow ArXiv (cs.CL, cs.LG), read ML Twitter/X threads, watch conference talks (NeurIPS, ICML, ACL, EMNLP). The field moves fast, but with solid foundations, new papers become incremental extensions of what you already know.

"The only way to learn mathematics is to do mathematics." โ€” Paul Halmos

The same is true for machine learning. The equations in these pages are not the destination โ€” they're the map. The territory is in the code, the experiments, and the papers.