Companion Volume III — Frontier Topics & Complete Solutions

ML First Principles, Part III:
Advanced Topics & Practice

Efficient attention, parameter-efficient fine-tuning, quantization, reasoning models,
multi-modal architectures, RAG, evaluation — and solutions to every exercise.

Efficient Attention & Long Context
- Flash Attention · Sliding Window · Linear Attention · RoPE Scaling
Parameter-Efficient Fine-Tuning
- Full fine-tuning · LoRA derivation · QLoRA · Adapters · Prefix Tuning
Quantization
- INT8 · GPTQ · AWQ · GGUF · Why 4-bit works
Reasoning Models & Chain-of-Thought
- CoT prompting · Self-consistency · Tree of Thought · o1/o3 · DeepSeek-R1 · Test-time compute
Retrieval-Augmented Generation
- Architecture · Embedding models · Vector databases · Chunking · Reranking
Multi-Modal Models
- Vision Transformers · CLIP · LLaVA · Visual tokenization
Evaluation & Benchmarks
- Perplexity · BLEU/ROUGE · LLM benchmarks · Contamination · Human eval
Frontier Concepts
- Distillation · Speculative decoding · Constitutional AI · State-space models
Complete Exercise Solutions
- All exercises from Parts I, II, and III with detailed solutions

Advanced Topic I

Efficient Attention & Long Context

The Quadratic Bottleneck

Standard self-attention computes an \(n \times n\) attention matrix, where \(n\) is the sequence length. Both memory and computation are \(O(n^2 d)\). For a 128K token context at \(d = 4096\): the attention matrix alone has \(128K^2 = 16.4\) billion entries per head. This is the fundamental scaling challenge.

Flash Attention — Exact Attention, Hardware-Aware

Dao et al. (2022, 2023) — "FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness" and "FlashAttention-2." Reformulated attention to be tiling-friendly, reducing memory from \(O(n^2)\) to \(O(n)\) without any approximation.

GPUs have a hierarchy of memory: HBM (large, slow) and SRAM (small, fast). Standard attention materializes the full \(n \times n\) matrix in HBM, then reads it back for softmax — this memory transfer is the bottleneck, not the arithmetic. Flash Attention computes attention in tiles that fit in SRAM, never materializing the full matrix.

The Key Trick: Online Softmax

The challenge is that softmax requires the maximum value across the entire row (for numerical stability) and the sum of all exponentials (for normalization). Flash Attention uses an online algorithm that processes blocks incrementally:

Online Softmax Accumulation $$m^{(j)} = \max(m^{(j-1)}, \max(\mathbf{S}^{(j)}))$$ $$\ell^{(j)} = e^{m^{(j-1)} - m^{(j)}} \ell^{(j-1)} + \text{rowsum}(e^{\mathbf{S}^{(j)} - m^{(j)}})$$ $$\mathbf{O}^{(j)} = e^{m^{(j-1)} - m^{(j)}} \mathbf{O}^{(j-1)} + e^{\mathbf{S}^{(j)} - m^{(j)}} \mathbf{V}^{(j)}$$

Here \(m\) tracks the running maximum, \(\ell\) tracks the running softmax denominator, and \(\mathbf{O}\) accumulates the unnormalized output. After processing all blocks, the final output is \(\mathbf{O} / \ell\). The key insight: the correction factor \(e^{m^{(j-1)} - m^{(j)}}\) adjusts previous accumulations when a new block reveals a larger maximum.

Result: 2–4× faster, \(O(n)\) memory instead of \(O(n^2)\), and the answer is mathematically identical to standard attention.

Sliding Window Attention (Mistral/Gemma-style)

Instead of attending to all previous tokens, each layer only attends to the most recent \(W\) tokens:

\text{Attention}(q_t, K, V) = \text{softmax}\!\left(\frac{q_t K_{[t-W:t]}^T}{\sqrt{d_k}}\right) V_{[t-W:t]}

With \(L\) layers and window size \(W\), the effective receptive field is \(L \times W\) tokens — similar to how stacked CNN layers build large receptive fields from small kernels. Mistral 7B uses \(W = 4096\) with \(L = 32\), giving an effective field of 128K tokens.

RoPE Context Extension

Models trained with a maximum sequence length can be extended through RoPE scaling techniques:

Position Interpolation (linear scaling) $$\theta'_i = \theta_i / s \quad \text{where } s = \frac{L_{\text{target}}}{L_{\text{trained}}}$$

YaRN (Yet another RoPE extensioN) applies different scaling factors to different frequency bands — high-frequency components (which encode fine-grained position) are scaled more aggressively than low-frequency components (which encode coarse position). This preserves local position information while extending the overall range.

(a) Standard attention for sequence length 32K with \(d = 128\) per head: how many FLOPs for the \(\mathbf{QK}^T\) multiplication? How many bytes to store the attention matrix in FP16?
(b) With sliding window attention (\(W = 4096\)), what fraction of the full attention FLOPs do you use?
(c) Why does Flash Attention not help with the computational cost of attention, only the memory cost?

· · ·

Advanced Topic II

Parameter-Efficient Fine-Tuning

The Fine-Tuning Landscape

Full Fine-Tuning LoRA Prompt Tuning ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │ Update ALL │ │ Freeze base │ │ Freeze ALL │ │ parameters │ │ Add small │ │ Prepend soft │ │ │ │ rank updates │ │ tokens only │ │ 7B params │ │ ~0.1% params │ │ ~0.001% │ │ ~28GB VRAM │ │ ~6GB VRAM │ │ minimal VRAM │ └──────────────┘ └──────────────┘ └──────────────┘ Best quality Great tradeoff Limited capacity

LoRA — Low-Rank Adaptation (Full Derivation)

Hu et al. (2021) — "LoRA: Low-Rank Adaptation of Large Language Models." Showed that the weight updates during fine-tuning have low intrinsic rank — you can approximate them with tiny low-rank matrices and achieve comparable quality to full fine-tuning.

During fine-tuning, you're changing a huge matrix \(\mathbf{W}\) to \(\mathbf{W} + \Delta\mathbf{W}\). The key insight is that \(\Delta\mathbf{W}\) typically has low rank — it lives in a small subspace. Instead of storing the full \(\Delta\mathbf{W} \in \mathbb{R}^{d \times d}\), we decompose it into two small matrices.

The Math

For a weight matrix \(\mathbf{W}_0 \in \mathbb{R}^{d \times k}\), instead of updating to \(\mathbf{W}_0 + \Delta\mathbf{W}\), we parametrize:

LoRA Decomposition $$\mathbf{W} = \mathbf{W}_0 + \Delta\mathbf{W} = \mathbf{W}_0 + \mathbf{B}\mathbf{A}$$ $$\text{where } \mathbf{B} \in \mathbb{R}^{d \times r}, \; \mathbf{A} \in \mathbb{R}^{r \times k}, \; r \ll \min(d, k)$$

The forward pass becomes:

\mathbf{h} = \mathbf{W}_0 \mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x}

Where \(\alpha\) is a scaling factor (controls the magnitude of the LoRA update relative to the pretrained weights).

Parameter Savings

For a model dimension \(d = 4096\), a weight matrix \(\mathbf{W} \in \mathbb{R}^{4096 \times 4096}\):

Full fine-tuning: \(4096 \times 4096 = 16.8M\) parameters per matrix.
LoRA with \(r = 16\): \(\mathbf{B}: 4096 \times 16 = 65K\) + \(\mathbf{A}: 16 \times 4096 = 65K\) = 131K parameters per matrix.

That's \(131K / 16.8M = 0.78\%\) of the original parameters. For all Q, K, V, and output projections across 32 layers: \(131K \times 4 \times 32 = 16.8M\) trainable parameters out of 7B total = 0.24%.

Initialization

\(\mathbf{A}\) is initialized with a random Gaussian, \(\mathbf{B}\) is initialized to zero. This means \(\Delta\mathbf{W} = \mathbf{BA} = 0\) at the start of training — the model begins exactly as the pretrained model, and the LoRA update grows gradually from zero.

Merging

After training, you can merge LoRA weights back into the base model: \(\mathbf{W}_{\text{merged}} = \mathbf{W}_0 + \frac{\alpha}{r}\mathbf{BA}\). The merged model has zero additional inference cost — it's the same size and speed as the original.

QLoRA — Quantized LoRA

Dettmers et al. (2023) — "QLoRA: Efficient Finetuning of Quantized Language Models." Fine-tune a 65B model on a single 48GB GPU by combining 4-bit quantization of the base model with LoRA adapters in 16-bit.

Three innovations: (1) 4-bit NormalFloat (NF4) — a quantization type optimal for normally-distributed weights. (2) Double quantization — quantize the quantization constants themselves. (3) Paged optimizers — use CPU memory to handle GPU memory spikes.

Other PEFT Methods

Method	What's Trainable	% Params	Quality
Full fine-tuning	Everything	100%	Best (if enough data)
LoRA	Low-rank updates to attention	0.1–1%	Near-full quality
QLoRA	Same as LoRA, base in 4-bit	0.1–1%	Slightly below LoRA
Prefix tuning	Soft prefix tokens per layer	<0.1%	Good for specific tasks
Adapters	Small bottleneck layers inserted	1–5%	Good
BitFit	Only bias terms	<0.1%	Surprisingly decent

· · ·

Advanced Topic III

Quantization — Running LLMs on Consumer Hardware

A 70B parameter model in FP16 requires 140GB of memory — far more than any consumer GPU. Quantization reduces each parameter from 16 bits to 8, 4, or even 2 bits, shrinking the model proportionally. The question is: how much quality do you lose?

The Math of Quantization

Linear quantization maps a floating-point value to a discrete integer:

Symmetric Quantization $$x_q = \text{round}\!\left(\frac{x}{s}\right), \quad s = \frac{\max(|x|)}{2^{b-1} - 1}$$ $$\hat{x} = x_q \cdot s \quad \text{(dequantize for computation)}$$

Where \(s\) is the scale factor and \(b\) is the bit-width. For 8-bit: values map to \([-127, 127]\). For 4-bit: values map to \([-7, 7]\).

Why LLM Weights Tolerate Quantization

Neural network weights tend to follow a roughly Gaussian distribution centered near zero, with most values small and rare values large. The large outliers are the problem — they force a wide scale factor \(s\), reducing resolution for the many small values. Modern methods handle this with:

Per-channel quantization: separate scale per output channel (row of weight matrix), so one outlier channel doesn't affect others.

GPTQ (Frantar et al., 2022): Post-training quantization that iteratively quantizes weights while compensating for the quantization error using second-order (Hessian) information. Quantizes one column at a time, adjusting remaining columns to minimize the output error.

AWQ (Lin et al., 2023): Observes that 1% of weights (the salient ones corresponding to important activations) matter much more than the rest. Protects these salient channels with per-channel scaling before quantization.

Memory Savings

Precision	Bits/param	7B Model	70B Model
FP32	32	28 GB	280 GB
FP16/BF16	16	14 GB	140 GB
INT8	8	7 GB	70 GB
INT4 (GPTQ/AWQ)	4	3.5 GB	35 GB
2-bit (experimental)	2	1.75 GB	17.5 GB

At 4-bit quantization, a 70B model fits on a single A100-80GB with room for KV cache. A 7B model fits on consumer GPUs with 6GB+ VRAM. Quality loss at INT4 is typically 1–3% on benchmarks.

· · ·

Advanced Topic IV

Reasoning Models & Chain-of-Thought

Chain-of-Thought Prompting

Wei et al. (2022) — "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." Showed that simply adding "Let's think step by step" or providing step-by-step examples dramatically improves reasoning performance.

An LLM predicting the answer to "What is 17 × 24?" in a single forward pass must compute the answer in ~100 layer operations. But if it generates intermediate steps ("17 × 20 = 340, 17 × 4 = 68, 340 + 68 = 408"), each step is a much simpler computation, and the answer from each step feeds into the next through the context.

Why CoT works mathematically: Each generated token triggers a full forward pass through the network. Chain-of-thought effectively gives the model \(O(T)\) serial compute steps (where \(T\) is the number of reasoning tokens) instead of just the fixed depth of the network. It transforms a bounded-depth computation into an arbitrarily long sequential one.

Self-Consistency

Wang et al. (2022) — "Self-Consistency Improves Chain of Thought Reasoning." Sample multiple reasoning paths and take a majority vote on the final answer. Diverse reasoning paths that converge on the same answer are more likely correct.

Self-Consistency $$\hat{a} = \arg\max_a \sum_{i=1}^{K} \mathbb{1}[a_i = a]$$

Generate \(K\) different reasoning chains (using temperature sampling), extract the final answer from each, and take the majority vote. This exploits the fact that errors in reasoning tend to be diverse (different chains fail differently), while correct reasoning converges.

Test-Time Compute Scaling

OpenAI o1/o3 (2024–2025) and DeepSeek-R1 (2025) — Models trained with reinforcement learning to generate extended internal reasoning chains. These models spend more computation at inference time, trading tokens for accuracy.

The Scaling Law of Thought

Traditional scaling increases training compute (more parameters, more data). Reasoning models introduce a second axis: inference compute. The longer the model "thinks" (generates internal reasoning tokens), the better it performs on hard problems.

Test-Time Compute Scaling (empirical) $$\text{Accuracy}(C_{\text{test}}) \approx A - B \cdot C_{\text{test}}^{-\alpha}$$

Where \(C_{\text{test}}\) is the number of inference tokens. Like training scaling laws, performance improves as a power law of compute — but now the compute is spent at inference time.

DeepSeek-R1: RL for Reasoning

DeepSeek-R1 trains the model to produce long chains of thought using reinforcement learning with a simple reward: correctness of the final answer. The model learns to decompose problems, verify intermediate steps, backtrack when stuck, and allocate more tokens to harder sub-problems.

Key innovation — emergent behaviors: The model spontaneously learns strategies like self-verification ("let me check this"), exploration of alternatives ("alternatively, I could..."), and reflection ("wait, that doesn't seem right") — all from the simple reward signal of getting the final answer correct. No explicit instruction to reason this way was provided.

Tree of Thought

Yao et al. (2023) — "Tree of Thoughts." Generalizes chain-of-thought from a single chain to a tree — the model explores multiple reasoning branches, evaluates them, and can backtrack.

Chain-of-Thought: Tree of Thoughts: Problem Problem │ │ ▼ ┌───┼───┐ Step 1 Step 1a 1b 1c │ │ │ ▼ ┌─┼─┐ ┌─┼─┐ Step 2 2a 2b 2c 2d │ ✗ │ │ ✗ ▼ ▼ ▼ Answer Ans. Ans. (vote)

· · ·

Advanced Topic V

Retrieval-Augmented Generation

Lewis et al. (2020) — "Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks." Combined a retriever (finds relevant documents) with a generator (produces the answer), allowing LLMs to access information beyond their training data.

An LLM's knowledge is frozen at training time. RAG gives it a "reference library" — before answering, the model retrieves relevant documents and includes them in its context. This reduces hallucinations, enables up-to-date information, and provides citations.

The RAG Pipeline

┌─────────┐ ┌────────────┐ ┌──────────────┐ ┌─────────┐ │ User │────→│ Embed │────→│ Vector │────→│ Top-k │ │ Query │ │ query │ │ similarity │ │ chunks │ └─────────┘ └────────────┘ │ search │ └────┬────┘ └──────────────┘ │ ▼ ┌──────────────┐ ┌─────────┐ │ LLM │←────│ Augment │ │ generates │ │ prompt │ │ answer │ │ with │ └──────────────┘ │ chunks │ └─────────┘

Step 1: Document Processing

Split documents into chunks (typically 256–512 tokens with overlap), then embed each chunk using a sentence embedding model:

\mathbf{e}_i = f_{\text{embed}}(\text{chunk}_i) \in \mathbb{R}^d

Popular embedding models: E5, BGE, Cohere embed-v3, OpenAI text-embedding-3. Store the embedding vectors in a vector database (FAISS, Pinecone, Milvus, Chroma).

Step 2: Retrieval

Embed the user query with the same model, then find the nearest chunks by cosine similarity:

\text{score}(q, c_i) = \frac{\mathbf{e}_q^T \mathbf{e}_{c_i}}{\|\mathbf{e}_q\| \|\mathbf{e}_{c_i}\|}

Return the top-\(k\) chunks (typically \(k = 3\text{–}10\)).

Step 3: Augmented Generation

Construct a prompt like: [System: Use the following context to answer. Context: {chunk1} {chunk2} ...] [User: {query}]

Advanced RAG Techniques

Reranking: After initial retrieval, use a cross-encoder model (which sees query + document jointly, not just embeddings) to re-score and reorder results. Much more accurate than embedding-only similarity but too slow for the initial search.

Hybrid search: Combine semantic search (embeddings) with lexical search (BM25/TF-IDF) for better recall. Semantic search finds paraphrases; lexical search catches exact keywords.

Query transformation: Rewrite the query to improve retrieval — decompose complex questions, expand abbreviations, generate hypothetical answer passages (HyDE).

· · ·

Advanced Topic VI

Multi-Modal Models

Vision Transformers (ViT)

Dosovitskiy et al. (2020) — "An Image Is Worth 16x16 Words." Applied the transformer directly to images by splitting them into patches, treating each patch as a "token." Proved that pure transformer architectures match or beat CNNs for vision tasks when trained at sufficient scale.

How ViT Works

Image → Sequence of Tokens $$\text{Image} \in \mathbb{R}^{H \times W \times C} \longrightarrow \text{Patches} \in \mathbb{R}^{N \times (P^2 \cdot C)} \xrightarrow{\mathbf{W}_{\text{patch}}} \text{Tokens} \in \mathbb{R}^{N \times d}$$

A 224×224 image with 16×16 patches gives \(N = (224/16)^2 = 196\) tokens. Each patch is linearly projected to dimension \(d\), then processed by a standard transformer encoder. Classification uses a special [CLS] token prepended to the sequence.

CLIP — Connecting Vision and Language

Radford et al. (2021) — "Learning Transferable Visual Models From Natural Language Supervision." Trained a vision encoder and text encoder jointly on 400M image-text pairs from the internet, creating a shared embedding space where images and text are directly comparable.

Contrastive Learning Objective

CLIP Loss (for a batch of N image-text pairs) $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[\log\frac{\exp(\mathbf{v}_i^T\mathbf{t}_i / \tau)}{\sum_{j=1}^{N}\exp(\mathbf{v}_i^T\mathbf{t}_j / \tau)} + \log\frac{\exp(\mathbf{t}_i^T\mathbf{v}_i / \tau)}{\sum_{j=1}^{N}\exp(\mathbf{t}_i^T\mathbf{v}_j / \tau)}\right]$$

Where \(\mathbf{v}_i\) is the image embedding, \(\mathbf{t}_i\) is the text embedding, and \(\tau\) is a learned temperature parameter. The loss maximizes similarity between matched pairs and minimizes similarity between unmatched pairs.

Vision-Language Models (LLaVA Pattern)

Modern vision-language models follow a simple recipe:

┌────────────┐ ┌──────────────┐ ┌──────────────────┐ │ Image │────→│ Vision │────→│ Projection │ │ │ │ Encoder │ │ (MLP or linear) │ │ │ │ (ViT/CLIP) │ │ → visual tokens │ └────────────┘ └──────────────┘ └────────┬─────────┘ │ ▼ ┌────────────┐ ┌──────────────────┐ │ Text │─────────────────────────→│ LLM Decoder │ │ tokens │ │ [visual tokens] │ │ │ │ [text tokens] │ └────────────┘ │ → response │ └──────────────────┘

The vision encoder extracts visual features, a projection layer maps them into the LLM's embedding space as "visual tokens," and the LLM processes both visual and text tokens with its standard self-attention mechanism. From the LLM's perspective, image features look just like text tokens.

· · ·

Advanced Topic VII

Metric	Formula Intuition	Best For
BLEU	Precision of n-gram overlap with reference	Machine translation
ROUGE-L	Longest common subsequence with reference	Summarization
BERTScore	Cosine similarity of BERT embeddings per token	Semantic similarity
METEOR	Harmonic mean of precision & recall with synonyms	Translation (improved)

Benchmark	What It Tests	Format
MMLU	Broad knowledge across 57 subjects	Multiple choice
HumanEval / MBPP	Code generation correctness	Function completion → unit tests
GSM8K	Grade-school math reasoning	Word problems → numerical answer
MATH	Competition-level math	Hard problems → solutions
ARC-Challenge	Science reasoning	Multiple choice
HellaSwag	Common sense completion	Choose most plausible continuation
TruthfulQA	Resistance to common misconceptions	Open-ended generation
MT-Bench / Chatbot Arena	Overall conversational quality	LLM-as-judge / human preference

ML First Principles, Part III:Advanced Topics & Practice

Contents

Efficient Attention & Long Context

The Quadratic Bottleneck

Flash Attention — Exact Attention, Hardware-Aware

The Key Trick: Online Softmax

Sliding Window Attention (Mistral/Gemma-style)

RoPE Context Extension

Parameter-Efficient Fine-Tuning

The Fine-Tuning Landscape

LoRA — Low-Rank Adaptation (Full Derivation)

The Math

Parameter Savings

Initialization

Merging

QLoRA — Quantized LoRA

Other PEFT Methods

Quantization — Running LLMs on Consumer Hardware

The Math of Quantization

Why LLM Weights Tolerate Quantization

Memory Savings

Reasoning Models & Chain-of-Thought

Chain-of-Thought Prompting

Self-Consistency

Test-Time Compute Scaling

The Scaling Law of Thought

DeepSeek-R1: RL for Reasoning

Tree of Thought

Retrieval-Augmented Generation

The RAG Pipeline

Step 1: Document Processing

Step 2: Retrieval

Step 3: Augmented Generation

Advanced RAG Techniques

Multi-Modal Models

Vision Transformers (ViT)

How ViT Works

CLIP — Connecting Vision and Language

Contrastive Learning Objective

Vision-Language Models (LLaVA Pattern)

Evaluation & Benchmarks

Perplexity — The Fundamental LM Metric

Generation Metrics

LLM Benchmarks

Frontier Concepts

Knowledge Distillation

Speculative Decoding

State-Space Models (SSMs)

The SSM Equation

Constitutional AI (CAI)

Complete Exercise Solutions

Part I — Mathematical Foundations

§1.1 Linear Algebra

§1.3 Optimization

Part I — Core Machine Learning

§2.2 Logistic Regression

Part I — Neural Networks

§3.3 Backpropagation

Part I — Deep Learning Architectures

§4.2 RNNs & LSTMs

Part I — NLP Foundations

§5.1 Word Embeddings

Part I — Attention & Transformers

§6.4 Transformer Architecture

Part I — Modern LLMs

§7.4–7.5 MoE and RLHF

Part III — Advanced Topics

§I Efficient Attention

§IV LLM Training Pipeline

What's Next?

ML First Principles, Part III:
Advanced Topics & Practice