A Complete Mathematical Journey

Machine Learning from First Principles
to Modern NLP

From vectors and probabilities through neural networks to transformers and large language models.
Every concept built from intuition → derivation → example → application.

Contents

  1. Mathematical Foundations
    • Linear Algebra Essentials · Probability & Statistics · Optimization Basics
  2. Core Machine Learning
    • Linear Regression · Logistic Regression · Loss Functions · Gradient Descent
  3. Neural Networks
    • The Perceptron · Multi-Layer Networks · Backpropagation Derivation · Activation Functions
  4. Deep Learning Architectures
    • CNNs · RNNs & LSTMs · Sequence-to-Sequence Models
  5. NLP Foundations
    • Word Embeddings · Word2Vec Derivation · GloVe · Subword Tokenization
  6. Attention & Transformers
    • Attention Mechanism · Self-Attention · Multi-Head Attention · Full Transformer Architecture
  7. Modern Large Language Models
    • GPT Architecture · Training Objectives · Scaling Laws · MoE & DeepSeek · RLHF
  8. Pattern Detection — A Unified View
    • Geometric Interpretation · Probabilistic Interpretation · How Models "See" Patterns
  9. Annotated Reading List & Roadmap
Part I

Mathematical Foundations

We only cover the math you actually need. Three pillars support all of machine learning: linear algebra gives us the language to describe data and transformations; probability gives us the framework to reason under uncertainty; optimization gives us the engine to learn.

1.1   Linear Algebra Essentials

Think of data as points in space. A single house can be described by its square footage and number of bedrooms — that's a point in 2D space. An image of 784 pixels lives in 784-dimensional space. All of ML is about finding structure (lines, planes, curved surfaces) in these high-dimensional spaces.

Vectors

A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered list of \(n\) numbers. It represents a single data point. For example, a house with 1500 sqft and 3 bedrooms is \(\mathbf{x} = [1500, 3]^T\).

Dot product — the most important operation in ML

The dot product of two vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^n\) is:

Definition — Dot Product $$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$$

The first form is algebraic: multiply corresponding elements and sum. The second is geometric: the product of their lengths scaled by the cosine of the angle between them.

Why this matters everywhere in ML: The dot product measures similarity. When \(\theta = 0\) (same direction), cosine is 1 and the dot product is maximized. When \(\theta = 90°\) (orthogonal), the dot product is 0. When \(\theta = 180°\) (opposite), it's negative. Attention mechanisms, linear regression, neural network layers — all are built on dot products.

Let \(\mathbf{a} = [1, 2, 3]\), \(\mathbf{b} = [4, 5, 6]\).
\(\mathbf{a} \cdot \mathbf{b} = 1 \times 4 + 2 \times 5 + 3 \times 6 = 4 + 10 + 18 = 32\).
\(\|\mathbf{a}\| = \sqrt{1+4+9} = \sqrt{14} \approx 3.74\),   \(\|\mathbf{b}\| = \sqrt{16+25+36} = \sqrt{77} \approx 8.77\).
\(\cos\theta = 32 / (3.74 \times 8.77) \approx 0.975\) — the vectors are very aligned.

Matrices and Linear Transformations

A matrix \(\mathbf{W} \in \mathbb{R}^{m \times n}\) transforms a vector from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). Every layer of a neural network does this: \(\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}\).

Matrix-Vector Product $$y_i = \sum_{j=1}^{n} W_{ij} x_j + b_i \qquad \text{for } i = 1, \ldots, m$$

Each row of \(\mathbf{W}\) is a "detector" — its dot product with \(\mathbf{x}\) measures how much \(\mathbf{x}\) matches that pattern. A neural network learns what patterns to detect by adjusting \(\mathbf{W}\).

Key operations summary

OperationNotationML Role
Dot product\(\mathbf{a}^T \mathbf{b}\)Similarity, attention scores, linear predictions
Matrix multiply\(\mathbf{W}\mathbf{x}\)Linear transformations / neural network layers
Transpose\(\mathbf{A}^T\)Switching rows ↔ columns, computing \(\mathbf{X}^T\mathbf{X}\)
Norm\(\|\mathbf{x}\| = \sqrt{\sum x_i^2}\)Regularization, normalization, distance
Outer product\(\mathbf{a}\mathbf{b}^T\)Rank-1 updates, attention value computation
(a) Given \(\mathbf{w} = [0.5, -1.0]\), \(\mathbf{x}_1 = [4, 2]\), \(\mathbf{x}_2 = [1, 5]\). Compute \(\mathbf{w}^T\mathbf{x}_1\) and \(\mathbf{w}^T\mathbf{x}_2\). Which data point does the weight vector "prefer"?
(b) Compute the cosine similarity between \(\mathbf{x}_1\) and \(\mathbf{x}_2\). Are they similar or dissimilar?

1.2   Probability & Statistics

ML models live in a world of uncertainty. A spam classifier doesn't say "this IS spam" — it says "the probability of spam is 0.97." Probability gives us the rigorous framework to make and update beliefs under uncertainty.

Bayes' Theorem — The Engine of Inference

Bayes' Theorem $$P(H \mid D) = \frac{P(D \mid H) \, P(H)}{P(D)}$$

Each term has a name and role:

TermNameMeaning
\(P(H \mid D)\)PosteriorUpdated belief about hypothesis \(H\) after seeing data \(D\)
\(P(D \mid H)\)LikelihoodHow probable the data is if hypothesis \(H\) is true
\(P(H)\)PriorInitial belief before seeing data
\(P(D)\)EvidenceTotal probability of the data (normalizer)
A medical test for a disease is 99% accurate (both sensitivity and specificity). The disease prevalence is 1%. You test positive — what's the probability you actually have the disease?

\(P(\text{disease} \mid +) = \frac{P(+ \mid \text{disease}) \cdot P(\text{disease})}{P(+)} = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \frac{0.0099}{0.0198} = 0.5\)

Only 50%! The low prevalence (prior) dramatically affects the posterior. This is why ML models need to account for class imbalance.

Key Distributions

Gaussian (Normal): \(p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\). Maximizing the likelihood of Gaussian data leads directly to minimizing squared error — which is why least squares regression is so fundamental.

Bernoulli: \(P(y=1) = p\), \(P(y=0) = 1-p\). The basis of binary classification. Maximizing Bernoulli likelihood leads to the cross-entropy loss.

Categorical / Softmax: Generalizes Bernoulli to \(K\) classes. \(P(y = k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}\). This is the softmax function — it appears in logistic regression, neural network outputs, and attention mechanisms.

Maximum Likelihood Estimation (MLE)

Given data \(\mathcal{D} = \{x_1, \ldots, x_N\}\), find the parameters \(\theta\) that maximize the probability of observing this data:

MLE Principle $$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^{N} p(x_i \mid \theta) = \arg\max_\theta \sum_{i=1}^{N} \log p(x_i \mid \theta)$$

We take the log because: (1) products become sums (easier to work with), (2) log is monotonic so the maximum doesn't change, (3) it avoids numerical underflow from multiplying many small probabilities.

MLE is the unifying principle behind almost all ML training. Linear regression minimizes squared error (= MLE under Gaussian noise). Logistic regression minimizes cross-entropy (= MLE under Bernoulli). Neural network training with cross-entropy loss is also MLE. When you understand MLE, you understand why we use these specific loss functions.

1.3   Optimization Basics

Imagine you're blindfolded on a hilly landscape, and you want to find the lowest valley. You can feel the slope under your feet. The strategy: always step in the direction that goes most steeply downhill. That's gradient descent.

Gradient — the Direction of Steepest Ascent

For a function \(f(\theta_1, \theta_2, \ldots, \theta_d)\), the gradient is the vector of all partial derivatives:

Gradient $$\nabla_\theta f = \left[\frac{\partial f}{\partial \theta_1}, \frac{\partial f}{\partial \theta_2}, \ldots, \frac{\partial f}{\partial \theta_d}\right]^T$$

The gradient points in the direction of steepest increase. To minimize a loss function, we go in the opposite direction.

Gradient Descent Update Rule

Update Rule $$\theta^{(t+1)} = \theta^{(t)} - \eta \, \nabla_\theta \mathcal{L}(\theta^{(t)})$$

Where \(\eta\) is the learning rate — the step size. Too large → you overshoot the minimum and diverge. Too small → you converge painfully slowly. This single hyperparameter is arguably the most important in all of deep learning.

Stochastic Gradient Descent (SGD)

Computing the gradient over all \(N\) data points is expensive. SGD approximates the true gradient using a random subset (mini-batch) of size \(B\):

$$\nabla_\theta \mathcal{L} \approx \frac{1}{B} \sum_{i \in \text{batch}} \nabla_\theta \mathcal{L}_i(\theta)$$

This is noisier but much faster, and the noise actually helps escape shallow local minima. Modern deep learning almost exclusively uses mini-batch SGD or its adaptive variants (Adam, AdamW).

The Chain Rule — Foundation of Backpropagation

If \(y = f(g(x))\), then:

Chain Rule $$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

This extends to arbitrarily long chains — which is exactly what a deep neural network is. Backpropagation is nothing more than the systematic application of the chain rule through the computational graph.

(a) Let \(f(\theta) = (\theta - 3)^2\). Compute \(\nabla_\theta f\). Starting at \(\theta_0 = 0\), perform 3 steps of gradient descent with \(\eta = 0.1\). Where do you end up?
(b) Let \(y = \sigma(w \cdot x + b)\) where \(\sigma(z) = 1/(1 + e^{-z})\). Using the chain rule, compute \(\frac{\partial y}{\partial w}\).
· · ·
Part II

Core Machine Learning

2.1   Linear Regression — Full Derivation

You're a real-estate agent. You notice that bigger houses cost more. You want to draw the "best fitting line" through the data so you can predict the price of any new house. Linear regression finds that line by minimizing the total prediction error.

The Model

We assume a linear relationship between input features \(\mathbf{x}\) and output \(y\):

Linear Model $$\hat{y} = \mathbf{w}^T \mathbf{x} + b = \sum_{j=1}^{d} w_j x_j + b$$

Where \(\mathbf{w} \in \mathbb{R}^d\) are weights (how much each feature matters), \(b\) is the bias (the baseline prediction when all features are 0), and \(\hat{y}\) is the prediction.

The Loss Function — Why Squared Error?

We assume each observation has Gaussian noise: \(y_i = \mathbf{w}^T \mathbf{x}_i + b + \epsilon_i\) where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\).

The likelihood of all data is:

$$p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(y_i - \mathbf{w}^T \mathbf{x}_i - b)^2}{2\sigma^2}\right)$$

Taking the negative log likelihood (since we minimize):

Mean Squared Error (from MLE) $$\mathcal{L}(\mathbf{w}, b) = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{N}\sum_{i=1}^{N} (y_i - \mathbf{w}^T \mathbf{x}_i - b)^2$$

The constant terms from the Gaussian vanish because they don't depend on \(\mathbf{w}\). MSE loss is not an arbitrary choice — it's the MLE-optimal loss when noise is Gaussian.

Closed-Form Solution (Normal Equation)

Setting the gradient to zero and solving (absorbing \(b\) into \(\mathbf{w}\) by adding a column of 1s to \(\mathbf{X}\)):

Normal Equation $$\hat{\mathbf{w}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

Derivation: Write the loss in matrix form: \(\mathcal{L} = \frac{1}{N}(\mathbf{y} - \mathbf{X}\mathbf{w})^T(\mathbf{y} - \mathbf{X}\mathbf{w})\). Expand, take the gradient w.r.t. \(\mathbf{w}\), set to zero:

$$\nabla_\mathbf{w} \mathcal{L} = -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0 \implies \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$$

Gradient Descent Solution

For large datasets where the matrix inverse is expensive, we use gradient descent instead:

Gradient Update for Linear Regression $$w_j \leftarrow w_j - \eta \cdot \frac{2}{N} \sum_{i=1}^{N} ({\hat{y}_i - y_i}) \cdot x_{ij}$$

Read this equation: The update to weight \(w_j\) is proportional to the average of (prediction error) × (the corresponding feature value). If the model over-predicts (\(\hat{y}_i > y_i\)) and feature \(x_{ij}\) is positive, we decrease \(w_j\).

Data: \((x=1, y=2)\), \((x=2, y=4)\), \((x=3, y=5)\). Model: \(\hat{y} = wx + b\).

Normal equation (with \(\mathbf{X}\) augmented with 1s column):
\(\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3\end{bmatrix}\),   \(\mathbf{y} = \begin{bmatrix}2\\4\\5\end{bmatrix}\)

\(\mathbf{X}^T\mathbf{X} = \begin{bmatrix}3 & 6\\6 & 14\end{bmatrix}\),   \(\mathbf{X}^T\mathbf{y} = \begin{bmatrix}11\\23\end{bmatrix}\)

\(\hat{\mathbf{w}} = \begin{bmatrix}3 & 6\\6 & 14\end{bmatrix}^{-1}\begin{bmatrix}11\\23\end{bmatrix} = \frac{1}{6}\begin{bmatrix}14 & -6\\-6 & 3\end{bmatrix}\begin{bmatrix}11\\23\end{bmatrix} = \begin{bmatrix}2/3 \\ 3/2\end{bmatrix}\)

So \(b \approx 0.67\), \(w = 1.5\). The best line is \(\hat{y} = 1.5x + 0.67\).

2.2   Logistic Regression — Full Derivation

Now you're classifying emails as spam or not-spam. The output should be a probability between 0 and 1. A linear function \(\mathbf{w}^T\mathbf{x}\) can produce any real number, so we squeeze it through the sigmoid function to get a probability.

The Sigmoid Function

Sigmoid $$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

Properties: \(\sigma(0) = 0.5\), \(\lim_{z \to \infty} \sigma(z) = 1\), \(\lim_{z \to -\infty} \sigma(z) = 0\). Its derivative has a beautiful form: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) — which simplifies gradient calculations enormously.

The Model

$$P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)$$

Loss Function — Deriving Cross-Entropy from MLE

Each label follows a Bernoulli: \(P(y_i \mid \mathbf{x}_i) = \hat{p}_i^{y_i}(1 - \hat{p}_i)^{1 - y_i}\) where \(\hat{p}_i = \sigma(\mathbf{w}^T\mathbf{x}_i + b)\).

Negative log-likelihood:

Binary Cross-Entropy Loss $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\right]$$

Interpreting each term: When \(y_i = 1\), only the first term survives — it penalizes low \(\hat{p}_i\). When \(y_i = 0\), only the second term survives — it penalizes high \(\hat{p}_i\). The loss is 0 when predictions perfectly match labels.

Gradient Derivation

The gradient turns out to be strikingly elegant:

Gradient of Cross-Entropy w.r.t. Weights $$\frac{\partial \mathcal{L}}{\partial w_j} = \frac{1}{N}\sum_{i=1}^{N} (\hat{p}_i - y_i) \cdot x_{ij}$$

Derivation sketch: Using \(\frac{\partial}{\partial z}[-y\log\sigma(z) - (1-y)\log(1-\sigma(z))] = \sigma(z) - y\), and then applying the chain rule \(\frac{\partial z}{\partial w_j} = x_j\).

Notice this has exactly the same form as the linear regression gradient! The error signal \((\hat{p}_i - y_i)\) gets multiplied by the input feature \(x_{ij}\). This is not a coincidence — both are members of the generalized linear model family.

Geometric Interpretation — The Decision Boundary

Logistic regression defines a linear decision boundary: the hyperplane where \(\mathbf{w}^T\mathbf{x} + b = 0\) (i.e., \(\hat{p} = 0.5\)). On one side, \(\hat{p} > 0.5\) (predict class 1); on the other, \(\hat{p} < 0.5\) (predict class 0). The weight vector \(\mathbf{w}\) is perpendicular to this boundary and points toward the class-1 side.

(a) A spam classifier has \(w_{\text{free}} = 2.0\), \(w_{\text{meeting}} = -1.5\), \(b = -0.5\). For an email with features \(x_{\text{free}} = 1\), \(x_{\text{meeting}} = 1\), compute \(P(\text{spam})\).
(b) If the true label is \(y = 1\) (spam), compute the binary cross-entropy loss for this example.
(c) Compute the gradient update for \(w_{\text{free}}\) with \(\eta = 0.1\).

2.3   Loss Functions — A Unified View

LossFormulaWhen to UseProbabilistic Origin
MSE\(\frac{1}{N}\sum(y_i - \hat{y}_i)^2\)RegressionGaussian likelihood
Binary CE\(-\frac{1}{N}\sum[y\log\hat{p} + (1-y)\log(1-\hat{p})]\)Binary classificationBernoulli likelihood
Categorical CE\(-\frac{1}{N}\sum\sum_{k} y_k \log \hat{p}_k\)Multi-classCategorical likelihood
Every standard loss function is just the negative log-likelihood under a specific probabilistic assumption about the data. Choosing a loss function = choosing your noise model.
· · ·
Part III

Neural Networks

3.1   From Linear Models to Neural Networks

Linear regression can only draw straight lines. But what if the relationship between house size and price is curved? We need a way to bend the line. That's what a neural network does: it stacks linear transformations with nonlinear "activations" to create arbitrarily complex functions.

Why linearity isn't enough

If you compose two linear functions, \(f(\mathbf{x}) = \mathbf{W}_2(\mathbf{W}_1 \mathbf{x}) = (\mathbf{W}_2 \mathbf{W}_1)\mathbf{x}\), you just get another linear function. Stacking linear layers without activations gives you nothing new. The activation function breaks this linearity.

A Single Neuron

Neuron $$a = \phi(\mathbf{w}^T \mathbf{x} + b)$$

Where \(\phi\) is a nonlinear activation function. Common choices:

NameFormulaDerivativeNotes
Sigmoid\(\frac{1}{1+e^{-z}}\)\(\sigma(z)(1-\sigma(z))\)Squashes to (0,1); causes vanishing gradients
Tanh\(\frac{e^z - e^{-z}}{e^z + e^{-z}}\)\(1 - \tanh^2(z)\)Squashes to (-1,1); zero-centered
ReLU\(\max(0, z)\)\(\begin{cases}1 & z > 0\\0 & z \leq 0\end{cases}\)Simple, fast; default choice for hidden layers
GELU\(z \cdot \Phi(z)\)smooth approx.Used in transformers; smooth version of ReLU

3.2   Multi-Layer Network (MLP) — Forward Pass

An \(L\)-layer network computes:

Forward Pass $$\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$$ $$\mathbf{a}^{[l]} = \phi(\mathbf{z}^{[l]})$$

Where \(\mathbf{a}^{[0]} = \mathbf{x}\) (input), and the final layer's activation depends on the task (sigmoid for binary, softmax for multi-class, identity for regression).

2-layer network, 2 inputs, 2 hidden units, 1 output (with sigmoid):
Input: \(\mathbf{x} = [1, 0.5]^T\)
\(\mathbf{W}^{[1]} = \begin{bmatrix}0.1 & 0.3\\0.2 & 0.4\end{bmatrix}\), \(\mathbf{b}^{[1]} = \begin{bmatrix}0\\0\end{bmatrix}\)
\(\mathbf{W}^{[2]} = \begin{bmatrix}0.5 & 0.6\end{bmatrix}\), \(b^{[2]} = 0\)

Layer 1: \(\mathbf{z}^{[1]} = \begin{bmatrix}0.1(1) + 0.3(0.5)\\0.2(1)+0.4(0.5)\end{bmatrix} = \begin{bmatrix}0.25\\0.40\end{bmatrix}\)
\(\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]}) = \begin{bmatrix}0.25\\0.40\end{bmatrix}\) (both positive, ReLU passes them through)

Layer 2: \(z^{[2]} = 0.5(0.25) + 0.6(0.40) = 0.365\)
\(a^{[2]} = \sigma(0.365) = \frac{1}{1 + e^{-0.365}} \approx 0.59\)

The network outputs probability ≈ 0.59.

3.3   Backpropagation — Full Derivation

Backpropagation answers one question: how should I adjust each weight to reduce the loss? It works backwards from the output, using the chain rule to attribute "blame" for the error to each weight in the network.

Step-by-step for a 2-layer network with sigmoid output and BCE loss

Step 1: Output layer error

$$\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial z^{[2]}} = a^{[2]} - y$$

(This elegant result comes from the cancellation between the BCE derivative and sigmoid derivative.)

Step 2: Output layer gradients

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} \cdot (\mathbf{a}^{[1]})^T \qquad \frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]}$$

Step 3: Propagate error backward

$$\boldsymbol{\delta}^{[1]} = (\mathbf{W}^{[2]})^T \delta^{[2]} \odot \phi'(\mathbf{z}^{[1]})$$

Where \(\odot\) is element-wise multiplication. Each hidden neuron's error is proportional to: (1) how strongly it connects to the output, times (2) the output error, times (3) the local derivative of its activation.

Step 4: Hidden layer gradients

$$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \boldsymbol{\delta}^{[1]} \cdot \mathbf{x}^T \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[1]}} = \boldsymbol{\delta}^{[1]}$$

The General Pattern

Backpropagation — General Form $$\boldsymbol{\delta}^{[l]} = \left[(\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}\right] \odot \phi'(\mathbf{z}^{[l]})$$ $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T$$

This recurses from the last layer to the first. The gradient for each weight is: (error at that layer) × (activation from the previous layer). This is why it's called backpropagation — the error signal propagates backward through the network.

Vanishing gradients: In deep networks with sigmoid/tanh activations, the gradient at early layers becomes a product of many derivatives less than 1 — it shrinks exponentially. This is why ReLU (derivative = 1 for positive inputs) became standard: it lets gradients flow unchanged through active neurons.
Rumelhart, Hinton & Williams (1986) — "Learning representations by back-propagating errors." The foundational paper that made neural network training practical. It showed that gradient descent + chain rule can learn useful internal representations in hidden layers.
Continuing the numerical example above (output = 0.59, true label \(y = 1\)):
(a) Compute \(\delta^{[2]} = a^{[2]} - y\).
(b) Compute \(\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}}\).
(c) Compute \(\boldsymbol{\delta}^{[1]}\) (using ReLU derivative = 1 since both units were positive).
(d) Update all weights with \(\eta = 0.1\).
· · ·
Part IV

Deep Learning Architectures

4.1   Convolutional Neural Networks

When you look at a photo, you don't analyze every pixel in relation to every other pixel. You first detect local features — edges, textures, colors — and then combine them into larger patterns (eyes, noses) and eventually whole objects (faces). CNNs mimic this hierarchy.

The Convolution Operation

A filter (kernel) \(\mathbf{K} \in \mathbb{R}^{k \times k}\) slides over the input, computing a dot product at each position:

2D Convolution $$(\mathbf{I} * \mathbf{K})_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} I_{i+m, j+n} \cdot K_{m,n}$$

Each term: \(I_{i+m, j+n}\) is the pixel at position \((i+m, j+n)\); \(K_{m,n}\) is the filter weight. The sum is a dot product between the local image patch and the filter.

Key insight: The same filter is applied at every spatial location (weight sharing). If a filter learns to detect vertical edges, it can detect them anywhere in the image. This is translational equivariance.

CNN Architecture Pattern

Input → [Conv → ReLU → Pool]×N → Flatten → Fully-Connected → Output

Pooling (e.g., max-pooling) reduces spatial dimensions, providing some translation invariance and reducing computation. A 2×2 max pool takes the maximum in each 2×2 block, halving both height and width.

How CNNs Learn Pattern Hierarchies

Layer 1 filters learn low-level features: edges, color gradients, simple textures. Layer 2 combines these into mid-level features: corners, curves, small texture patterns. Deeper layers learn increasingly abstract features: object parts, shapes, eventually entire object categories. The network builds a compositional representation of visual reality.

LeCun et al. (1998) — "Gradient-based learning applied to document recognition." Introduced LeNet-5 for handwritten digit recognition — the architecture that launched modern CNNs.

Krizhevsky, Sutskever & Hinton (2012) — "ImageNet Classification with Deep Convolutional Neural Networks" (AlexNet). Won ImageNet by a huge margin, igniting the deep learning revolution. Key innovations: ReLU activations, dropout regularization, GPU training.

4.2   Recurrent Neural Networks & LSTMs

Language is sequential: the meaning of "bank" depends on whether the previous words were "river" or "investment." RNNs process sequences one element at a time, maintaining a hidden state that acts as a "memory" of what came before.

RNN Equations

Vanilla RNN $$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$ $$\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y$$

At each time step \(t\), the hidden state \(\mathbf{h}_t\) is a function of the previous state \(\mathbf{h}_{t-1}\) and the current input \(\mathbf{x}_t\). The same weights \(\mathbf{W}_{hh}, \mathbf{W}_{xh}\) are shared across all time steps.

The Problem: Vanishing/Exploding Gradients

During backpropagation through time (BPTT), the gradient at time step \(t\) depends on a product of Jacobians: \(\prod_{k=t}^{T} \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}}\). For long sequences, this product either explodes or vanishes, making it impossible to learn long-range dependencies.

LSTM — Solving the Memory Problem

LSTM Equations $$\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \quad \text{(candidate memory)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell state update)}$$ $$\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$

The key idea: The cell state \(\mathbf{c}_t\) is an "information highway" — it flows through time with only additive updates, so gradients can flow backward without vanishing. The forget gate decides what to erase from memory; the input gate decides what new information to store.

Hochreiter & Schmidhuber (1997) — "Long Short-Term Memory." The paper that solved the vanishing gradient problem for recurrent networks with the gating mechanism. LSTMs dominated sequence modeling for nearly two decades.
(a) In an LSTM, if the forget gate outputs all 1s and the input gate outputs all 0s, what happens to the cell state? What about the reverse?
(b) Why is the cell state update additive (\(\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \ldots\)) rather than multiplicative? How does this help gradient flow?
· · ·
Part V

NLP Foundations

5.1   Word Embeddings — Giving Meaning to Numbers

How do you represent the word "king" as a number a neural network can process? One-hot encoding (a vector of mostly zeros with a single 1) treats every word as equally different from every other — "king" is as different from "queen" as from "banana." Word embeddings learn dense vectors where similar words are near each other in the vector space.

The Distributional Hypothesis

"You shall know a word by the company it keeps" — J.R. Firth, 1957. Words that appear in similar contexts have similar meanings. "Dog" and "cat" appear near "pet," "fur," "vet." This insight is the foundation of all modern word embeddings.

Word2Vec (Skip-Gram) — Derivation

Mikolov et al. (2013) — "Efficient Estimation of Word Representations in Vector Space" and "Distributed Representations of Words and Phrases and their Compositionality." These papers showed that simple neural networks trained on word co-occurrence learn vectors with remarkable algebraic properties (e.g., king - man + woman ≈ queen).

Setup: Given a center word \(w_c\), predict the surrounding context words \(w_o\). Each word has two embedding vectors: \(\mathbf{v}_w\) (when it's the center word) and \(\mathbf{u}_w\) (when it's a context word).

Probability model:

Skip-Gram Objective $$P(w_o \mid w_c) = \frac{\exp(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c})}{\sum_{w \in V} \exp(\mathbf{u}_w^T \mathbf{v}_{w_c})}$$

This is a softmax over the entire vocabulary. The numerator is the dot product between the context word's embedding and the center word's embedding — a measure of compatibility. The denominator normalizes across all possible words.

Training objective: Maximize the log-probability of observed (center, context) pairs:

$$\mathcal{L} = \sum_{t=1}^{T}\sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j} \mid w_t)$$

Negative Sampling approximation: Computing the full softmax over 100K+ words is prohibitive. Instead, for each positive pair, we sample \(k\) "negative" words (random words unlikely to be in the context) and train a binary classifier:

Negative Sampling Objective $$\mathcal{L} = \log\sigma(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}\left[\log\sigma(-\mathbf{u}_{w_i}^T \mathbf{v}_{w_c})\right]$$

This says: maximize the dot product with the true context word (positive pair) while minimizing the dot product with random words (negative pairs). The result: words appearing in similar contexts get pulled together in vector space.

GloVe — Global Vectors

Pennington, Socher & Manning (2014) — "GloVe: Global Vectors for Word Representation." Instead of predicting from local windows, GloVe directly factorizes the global co-occurrence matrix.

Key insight: Let \(X_{ij}\) be the count of how often word \(j\) appears in the context of word \(i\). GloVe optimizes:

GloVe Objective $$\mathcal{L} = \sum_{i,j=1}^{|V|} f(X_{ij}) \left(\mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2$$

Where \(f(X_{ij})\) is a weighting function that prevents very common co-occurrences from dominating. The objective says: the dot product of two word vectors should approximate the log of how often they co-occur.

Subword Tokenization (BPE)

Modern LLMs don't embed whole words — they use subword tokens. Byte Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent pairs: "l o w e r" → "lo w e r" → "low e r" → "low er" → "lower". This handles unseen words by decomposing them into known subwords, and reduces vocabulary size while maintaining expressiveness.

(a) If king - man + woman ≈ queen works in vector space, what does this tell you about how the embedding dimensions encode meaning?
(b) Why does the skip-gram model need two different embedding matrices (center vs. context)? What would happen if we used the same matrix for both?
· · ·
Part VI

Attention & Transformers

This is the section that makes everything else click. The transformer architecture, built entirely on the attention mechanism, is the foundation of all modern language models.

6.1   The Attention Mechanism — From First Principles

Imagine translating "The cat sat on the mat" into French. When generating the French word for "cat," you need to focus on ("attend to") the English word "cat" — not "the" or "mat." Attention is a mechanism that lets the model dynamically decide which parts of the input to focus on for each output.

Attention as Weighted Retrieval

Think of attention as a soft dictionary lookup. You have a query (what you're looking for), a set of keys (labels in the dictionary), and values (the stored information). The query is compared to all keys to compute relevance scores, and the output is a weighted sum of values.

Dot-Product Attention — Derivation

Step 1: Measure similarity. Given a query \(\mathbf{q}\) and a key \(\mathbf{k}\), their similarity is the dot product \(\mathbf{q}^T \mathbf{k}\). This measures how aligned the query is with the key in the embedding space.

Step 2: Scale. The dot product grows with dimensionality \(d_k\), which pushes softmax into saturated regions with tiny gradients. We divide by \(\sqrt{d_k}\):

$$\text{score}(q, k) = \frac{\mathbf{q}^T \mathbf{k}}{\sqrt{d_k}}$$
Why \(\sqrt{d_k}\)? If the entries of \(\mathbf{q}\) and \(\mathbf{k}\) are i.i.d. with zero mean and unit variance, their dot product has mean 0 and variance \(d_k\). Dividing by \(\sqrt{d_k}\) normalizes the variance to 1, keeping the softmax input in a well-behaved range.

Step 3: Normalize to get attention weights.

$$\alpha_i = \text{softmax}_i = \frac{\exp(\text{score}(q, k_i))}{\sum_j \exp(\text{score}(q, k_j))}$$

Step 4: Compute weighted output.

$$\text{output} = \sum_i \alpha_i \mathbf{v}_i$$

Putting it all together in matrix form:

Scaled Dot-Product Attention $$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

Where \(\mathbf{Q} \in \mathbb{R}^{n \times d_k}\), \(\mathbf{K} \in \mathbb{R}^{m \times d_k}\), \(\mathbf{V} \in \mathbb{R}^{m \times d_v}\). The output is \(\mathbb{R}^{n \times d_v}\).

Tiny attention example. Sequence: ["I", "love", "cats"], \(d_k = 2\).

\(\mathbf{Q} = \begin{bmatrix}1 & 0\\0 & 1\\1 & 1\end{bmatrix}\),   \(\mathbf{K} = \begin{bmatrix}1 & 0\\0 & 1\\0.5 & 0.5\end{bmatrix}\),   \(\mathbf{V} = \begin{bmatrix}1 & 0\\0 & 1\\0.5 & 0.5\end{bmatrix}\)

Scores \(= \mathbf{Q}\mathbf{K}^T / \sqrt{2}\):
\(\begin{bmatrix}1 & 0 & 0.5\\0 & 1 & 0.5\\1 & 1 & 1\end{bmatrix} / 1.41 = \begin{bmatrix}0.71 & 0 & 0.35\\0 & 0.71 & 0.35\\0.71 & 0.71 & 0.71\end{bmatrix}\)

Attention weights (softmax each row):
Row 1: \([0.41, 0.20, 0.29]\) — "I" mostly attends to itself.
Row 3: \([0.33, 0.33, 0.33]\) — "cats" attends equally (all scores equal).

Output = weights × V = weighted combination of value vectors for each position.

6.2   Self-Attention

In self-attention, the queries, keys, and values all come from the same sequence. Each token attends to all other tokens (including itself) in the sequence:

Self-Attention $$\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V$$

Where \(\mathbf{X} \in \mathbb{R}^{n \times d}\) is the sequence of input embeddings, and \(\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d \times d_k}\), \(\mathbf{W}^V \in \mathbb{R}^{d \times d_v}\) are learned projection matrices.

What does self-attention compute? For each token, it asks "which other tokens in this sequence are relevant to understanding me?" and creates a new representation that blends information from those relevant tokens. In "The animal didn't cross the street because it was too wide," self-attention at the word "it" can learn to assign high weight to "street" (the thing that's wide), resolving the reference.

6.3   Multi-Head Attention

A single attention head can only capture one type of relationship at a time (e.g., syntactic dependency). Multiple heads let the model capture different types of relationships simultaneously — one head might capture subject-verb agreement, another might capture coreference, another might capture semantic similarity.
Multi-Head Attention $$\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)$$ $$\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$$

If the model dimension is \(d = 512\) and we use \(h = 8\) heads, each head uses \(d_k = d_v = d/h = 64\). The total computation is similar to single-head attention with full dimensionality, but we get \(h\) different attention patterns.

6.4   The Transformer — Full Architecture

Vaswani et al. (2017) — "Attention Is All You Need." The paper that replaced RNNs with pure attention for sequence modeling. Introduced the transformer architecture that now underlies GPT, BERT, and all modern LLMs. One of the most impactful papers in ML history.

Encoder Block

Each encoder layer applies two sub-layers with residual connections and layer normalization:

Transformer Encoder Layer $$\mathbf{X}' = \text{LayerNorm}(\mathbf{X} + \text{MultiHead}(\mathbf{X}))$$ $$\mathbf{X}'' = \text{LayerNorm}(\mathbf{X}' + \text{FFN}(\mathbf{X}'))$$

Where the feed-forward network (FFN) is applied position-wise:

$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \, \phi(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$

Typically \(\mathbf{W}_1 \in \mathbb{R}^{d \times 4d}\), \(\mathbf{W}_2 \in \mathbb{R}^{4d \times d}\) — the FFN expands to 4× the dimension and then projects back. This is where "knowledge" is stored (the attention layers route information; the FFN layers process it).

Positional Encoding

Self-attention is permutation-invariant — it doesn't know word order. Positional encodings inject position information:

Sinusoidal Positional Encoding $$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$

Each dimension oscillates at a different frequency — low dimensions change slowly (encoding coarse position), high dimensions change rapidly (encoding fine position). This allows the model to attend to relative positions.

Decoder Block (for autoregressive generation)

The decoder adds causal (masked) self-attention — each position can only attend to previous positions, preventing the model from "seeing the future":

$$\text{CausalAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}$$

Where \(M_{ij} = 0\) if \(i \geq j\), and \(M_{ij} = -\infty\) otherwise. The \(-\infty\) values become 0 after softmax, effectively masking future positions.

Putting It All Together

A full transformer for language modeling: Input tokens → Embedding + Positional Encoding → N × Decoder Blocks → Linear projection → Softmax over vocabulary → Predicted next token.

(a) In a model with \(d = 256\), \(h = 4\) heads, what are the dimensions of \(\mathbf{W}_i^Q\), \(\mathbf{W}_i^K\), \(\mathbf{W}_i^V\) for each head?
(b) Write out the causal mask \(\mathbf{M}\) for a sequence of length 4. After softmax, what does each row of the attention matrix represent?
(c) Why is the residual connection (\(\mathbf{X} + \text{sublayer}(\mathbf{X})\)) critical for training deep transformers? (Hint: think about gradient flow.)
· · ·
Part VII

Modern Large Language Models

7.1   The GPT Architecture

Radford et al. (2018, 2019) — GPT and GPT-2. Showed that a decoder-only transformer pretrained on next-token prediction achieves remarkable few-shot capabilities. GPT-2 demonstrated that scaling up model size and data leads to emergent abilities.

Brown et al. (2020) — "Language Models are Few-Shot Learners" (GPT-3). Scaled to 175B parameters and showed that LLMs can perform tasks from a few examples in the prompt, without any gradient updates.

GPT is a decoder-only transformer. Its single training objective is strikingly simple:

Next-Token Prediction (Causal Language Modeling)

Training Objective $$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1}; \theta)$$

For each position \(t\), the model predicts the probability distribution over the entire vocabulary for the next token, given all previous tokens. The loss is the cross-entropy between this prediction and the actual next token.

How this single objective produces intelligence: To predict the next word well, the model must learn grammar, facts, reasoning, common sense, style, and much more. The prediction task is a "universal" task that requires understanding language at every level.

The Softmax Output Layer

The transformer's output at position \(t\) is a vector \(\mathbf{h}_t \in \mathbb{R}^d\). We project it to vocabulary size and apply softmax:

$$P(x_t = w \mid x_{

Where \(\mathbf{e}_w\) is the embedding of token \(w\). Often the output embedding matrix is tied (shared) with the input embedding matrix.

7.2   Scaling Laws

Kaplan et al. (2020) — "Scaling Laws for Neural Language Models." Discovered power-law relationships between model performance and compute, dataset size, and parameter count.

Hoffmann et al. (2022) — "Training Compute-Optimal Large Language Models" (Chinchilla). Showed that most LLMs were significantly undertrained — for a given compute budget, it's better to train a smaller model on more data than a larger model on less data.

The cross-entropy loss follows a power law:

Scaling Law $$L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N}$$

Where \(N\) is the number of parameters and \(\alpha_N \approx 0.076\). Similar power laws hold for dataset size and compute. The Chinchilla finding: the compute-optimal ratio is roughly 20 tokens per parameter — a 10B parameter model should be trained on ~200B tokens.

7.3   Modern Architectural Improvements

RMSNorm (replacing LayerNorm)

$$\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\sqrt{\frac{1}{d}\sum_{i=1}^{d} x_i^2}} \odot \boldsymbol{\gamma}$$

Simpler and faster than LayerNorm (no mean subtraction). Used in LLaMA, GPT-4, and most modern LLMs.

Rotary Position Embedding (RoPE)

Instead of adding positional information, RoPE encodes relative position directly into the attention computation by rotating query and key vectors:

$$\mathbf{q}_m^T \mathbf{k}_n = (\mathbf{R}_m \mathbf{W}^Q \mathbf{x}_m)^T (\mathbf{R}_n \mathbf{W}^K \mathbf{x}_n)$$

Where \(\mathbf{R}_m\) is a rotation matrix that depends on position \(m\). The dot product then depends on the relative position \(m - n\), which is more natural for language understanding.

SwiGLU Activation (in FFN)

$$\text{SwiGLU}(\mathbf{x}) = (\mathbf{x}\mathbf{W}_1) \odot \text{SiLU}(\mathbf{x}\mathbf{W}_\text{gate})$$

A gated variant that outperforms plain ReLU in practice. Used in LLaMA, PaLM, and most recent models.

Grouped-Query Attention (GQA)

Standard multi-head attention uses separate K and V projections per head, which is expensive during inference (the KV cache for all heads must be stored). GQA groups heads so that multiple query heads share the same K and V projections, reducing KV cache by a factor of the group size without significant quality loss.

7.4   Mixture of Experts (MoE) — The DeepSeek Approach

DeepSeek-V2, V3, R1 (2024–2025) — Open-weight models that pushed the efficiency frontier using fine-grained MoE with shared experts. DeepSeek-V3 achieves GPT-4-level quality with far less compute by activating only a small fraction of total parameters per token.
Not every input needs every part of the network. A question about cooking should activate "cooking knowledge" neurons, not "quantum physics" neurons. MoE replaces the single FFN with multiple "expert" FFNs and a router that selects which experts to use for each token.

MoE FFN Layer

Mixture of Experts $$\mathbf{y} = \sum_{i=1}^{N} g_i(\mathbf{x}) \cdot E_i(\mathbf{x})$$ $$g_i(\mathbf{x}) = \begin{cases}\text{softmax}(\text{TopK}(\mathbf{W}_g \mathbf{x}))_i & \text{if expert } i \text{ selected}\\0 & \text{otherwise}\end{cases}$$

Where \(E_i\) are expert FFN networks, and \(\mathbf{W}_g\) is the router/gating network. For each token, only the top-K experts (e.g., K = 2 out of 64) are activated. This means a model with 236B total parameters might only use ~21B parameters per token.

DeepSeek innovations

Fine-grained experts: Instead of 8 large experts, use 64+ smaller experts — more flexible routing. Shared experts: Some experts are always active (capturing common knowledge), while routed experts specialize. Auxiliary load-balancing loss: Without regularization, the router can "collapse" and send all tokens to a few experts. An auxiliary loss encourages balanced expert utilization:

$$\mathcal{L}_{\text{balance}} = \alpha \sum_{i=1}^{N} f_i \cdot P_i$$

Where \(f_i\) is the fraction of tokens routed to expert \(i\) and \(P_i\) is the average routing probability to expert \(i\).

7.5   RLHF — Aligning LLMs with Human Preferences

Ouyang et al. (2022) — "Training language models to follow instructions with human feedback" (InstructGPT/ChatGPT methodology). Showed that RLHF dramatically improves helpfulness, truthfulness, and safety.

The training pipeline has three stages:

Stage 1: Supervised Fine-Tuning (SFT). Fine-tune the pretrained model on high-quality (prompt, response) examples written by humans.

Stage 2: Reward Model Training. Collect human preferences: given a prompt and two responses, which is better? Train a reward model \(R_\phi(x, y)\) to predict these preferences using the Bradley-Terry model:

$$P(y_1 \succ y_2 \mid x) = \sigma(R_\phi(x, y_1) - R_\phi(x, y_2))$$

Stage 3: RL Optimization. Use PPO (Proximal Policy Optimization) to maximize the reward while staying close to the SFT policy:

RLHF Objective $$\max_\theta \, \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)}\left[R_\phi(x, y)\right] - \beta \, D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}]$$

The KL divergence term prevents the model from diverging too far from the reference policy (which would lead to reward hacking — finding outputs that score high on the reward model but are actually nonsensical).

DPO — Direct Preference Optimization

A simpler alternative that skips the reward model entirely. Starting from the RLHF objective's closed-form solution, DPO directly optimizes:

$$\mathcal{L}_{\text{DPO}} = -\mathbb{E}\left[\log\sigma\!\left(\beta \log\frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta\log\frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right]$$

Where \(y_w\) is the preferred response and \(y_l\) is the rejected response.

(a) In MoE with 64 experts and top-2 routing, what fraction of parameters are active per token? Why is this beneficial for training and inference?
(b) In the RLHF objective, what happens if \(\beta \to 0\)? What about \(\beta \to \infty\)?
(c) A model has been trained with next-token prediction on the entire internet. Why does it still need RLHF? Give a concrete example.
· · ·
Part VIII

Pattern Detection — A Unified View

8.1   How Models "See" — The Geometric View

Every classifier draws a boundary in the feature space that separates one class from another. Linear models draw flat boundaries (hyperplanes). Neural networks warp the space with nonlinear activations until the classes become linearly separable — then draw a flat boundary in the warped space.

Consider binary classification of images (human vs. non-human). Each image is a point in \(\mathbb{R}^{784}\) (for 28×28 pixels). These points form complex, interleaved clusters. A deep network learns a function \(f: \mathbb{R}^{784} \to \mathbb{R}^{d}\) that maps these points to a new space where they are linearly separable.

What a Neural Network Does, Geometrically $$\text{Input space} \xrightarrow{\text{Layer 1}} \text{new space} \xrightarrow{\text{Layer 2}} \text{new space} \xrightarrow{\ldots} \text{linearly separable space} \xrightarrow{\text{linear classifier}} \text{prediction}$$

Each layer performs two operations:

1. Affine transformation (\(\mathbf{Wx} + \mathbf{b}\)): rotates, scales, and shifts the data.

2. Nonlinear activation (\(\phi\)): folds and warps the space (ReLU "folds" by zeroing negative regions).

8.2   The Probabilistic View

A classifier estimates the posterior probability \(P(y = k \mid \mathbf{x})\). By Bayes' theorem:

$$P(y = k \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid y = k) \, P(y = k)}{P(\mathbf{x})}$$

The model learns the class-conditional density \(P(\mathbf{x} \mid y = k)\) — the "pattern" of what class-\(k\) data looks like — and the prior \(P(y = k)\). Classification becomes: which class makes the observed data most likely?

8.3   Similarity-Based Pattern Detection

Modern NLP models don't use explicit class-conditional densities. Instead, they operate in learned embedding spaces where patterns are encoded as proximity:

Distance metrics and their roles:

MetricFormulaUse
Euclidean\(\|\mathbf{a} - \mathbf{b}\|_2\)K-nearest neighbors, clustering
Cosine similarity\(\frac{\mathbf{a}^T\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}\)Word embeddings, sentence similarity, retrieval
Dot product\(\mathbf{a}^T\mathbf{b}\)Attention scores (unnormalized similarity)
Mahalanobis\(\sqrt{(\mathbf{a}-\mathbf{b})^T\Sigma^{-1}(\mathbf{a}-\mathbf{b})}\)Distance accounting for correlations

How Attention Detects Patterns

In a transformer, the attention mechanism computes \(\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\) — this is a similarity matrix between every pair of tokens. High similarity means "these tokens are relevant to each other." The model learns the projection matrices \(\mathbf{W}^Q, \mathbf{W}^K\) such that semantically or syntactically related tokens have high dot-product similarity after projection. This is how a transformer discovers patterns like subject-verb agreement, coreference, and semantic relationships — entirely from the training signal of next-token prediction.

8.4   What LLMs Actually Learn

Recent interpretability research has revealed that transformer layers develop a hierarchy of increasingly abstract representations:

Early layers: encode surface-level patterns — token identity, local syntax, simple co-occurrence statistics.

Middle layers: encode semantic relationships — word meanings in context, entity types, relational knowledge ("Paris is-capital-of France").

Late layers: encode task-specific features — next-token predictions draw on abstract reasoning, world knowledge, and contextual understanding built by earlier layers.

A large language model can be thought of as a giant, differentiable lookup table that has compressed the statistical structure of human language. The "patterns" it detects are not hand-coded rules but emergent statistical regularities — grammar, facts, reasoning strategies — all learned from the gradient signal of predicting the next token.
· · ·
Appendix

Annotated Reading List & Learning Roadmap

Essential Papers (in reading order)

#PaperYearWhy It Matters
1Rumelhart, Hinton & Williams — Backpropagation1986Made neural network training practical
2LeCun et al. — LeNet / CNNs1998Proved deep learning works for vision
3Hochreiter & Schmidhuber — LSTM1997Solved vanishing gradient for sequences
4Mikolov et al. — Word2Vec2013Efficient word embeddings; word arithmetic
5Pennington et al. — GloVe2014Global co-occurrence matrix factorization
6Bahdanau et al. — Attention for NMT2015Introduced attention for seq2seq
7Vaswani et al. — Transformer2017Replaced RNNs entirely; the foundation of LLMs
8Devlin et al. — BERT2019Bidirectional pretraining; dominated NLU benchmarks
9Radford et al. — GPT-22019Showed language modeling scales to generation quality
10Brown et al. — GPT-32020In-context learning and few-shot capabilities at scale
11Kaplan et al. — Scaling Laws2020Power-law relationships for compute-optimal training
12Hoffmann et al. — Chinchilla2022Compute-optimal: more data, smaller model
13Ouyang et al. — InstructGPT / RLHF2022Aligning models with human preferences
14Touvron et al. — LLaMA2023Open-weight models competitive with proprietary
15Rafailov et al. — DPO2023Simplified alignment without RL
16DeepSeek Team — DeepSeek-V32024MoE with fine-grained routing; open-weight frontier model

Recommended Learning Sequence

Weeks 1–2: Master Parts I–II. Implement linear and logistic regression from scratch (just NumPy). Derive every gradient by hand.

Weeks 3–4: Master Part III. Implement a 2-layer neural network from scratch with backpropagation. Train it on MNIST.

Weeks 5–6: Study Part IV. Implement a simple CNN in PyTorch. Implement a character-level RNN language model.

Weeks 7–8: Study Parts V–VI. Read the Word2Vec and Transformer papers. Implement self-attention from scratch. Then implement a full transformer decoder in PyTorch.

Weeks 9–12: Study Part VII. Read GPT-2 and GPT-3 papers. Fine-tune a small LLM. Study MoE and RLHF architectures. Read DeepSeek papers.

Ongoing: Follow ArXiv for new developments. Join the Hugging Face community. Read interpretability papers (Olah, Elhage et al.) to understand what models learn, not just how they're trained.

Implementation Resources

Karpathy's "Neural Networks: Zero to Hero" — YouTube series that builds GPT from scratch. Karpathy's nanoGPT — minimal GPT implementation in ~600 lines of PyTorch. Hugging Face Transformers — production-grade implementations of every major architecture. The Annotated Transformer (Rush, 2018) — line-by-line walkthrough of the original transformer code.

· · ·
Built as a comprehensive learning reference. Every equation tells a story — learn to read them and the papers will follow.