From vectors and probabilities through neural networks to transformers and large language models.
Every concept built from intuition → derivation → example → application.
We only cover the math you actually need. Three pillars support all of machine learning: linear algebra gives us the language to describe data and transformations; probability gives us the framework to reason under uncertainty; optimization gives us the engine to learn.
A vector \(\mathbf{x} \in \mathbb{R}^n\) is an ordered list of \(n\) numbers. It represents a single data point. For example, a house with 1500 sqft and 3 bedrooms is \(\mathbf{x} = [1500, 3]^T\).
The dot product of two vectors \(\mathbf{a}, \mathbf{b} \in \mathbb{R}^n\) is:
The first form is algebraic: multiply corresponding elements and sum. The second is geometric: the product of their lengths scaled by the cosine of the angle between them.
Why this matters everywhere in ML: The dot product measures similarity. When \(\theta = 0\) (same direction), cosine is 1 and the dot product is maximized. When \(\theta = 90°\) (orthogonal), the dot product is 0. When \(\theta = 180°\) (opposite), it's negative. Attention mechanisms, linear regression, neural network layers — all are built on dot products.
A matrix \(\mathbf{W} \in \mathbb{R}^{m \times n}\) transforms a vector from \(\mathbb{R}^n\) to \(\mathbb{R}^m\). Every layer of a neural network does this: \(\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}\).
Each row of \(\mathbf{W}\) is a "detector" — its dot product with \(\mathbf{x}\) measures how much \(\mathbf{x}\) matches that pattern. A neural network learns what patterns to detect by adjusting \(\mathbf{W}\).
| Operation | Notation | ML Role |
|---|---|---|
| Dot product | \(\mathbf{a}^T \mathbf{b}\) | Similarity, attention scores, linear predictions |
| Matrix multiply | \(\mathbf{W}\mathbf{x}\) | Linear transformations / neural network layers |
| Transpose | \(\mathbf{A}^T\) | Switching rows ↔ columns, computing \(\mathbf{X}^T\mathbf{X}\) |
| Norm | \(\|\mathbf{x}\| = \sqrt{\sum x_i^2}\) | Regularization, normalization, distance |
| Outer product | \(\mathbf{a}\mathbf{b}^T\) | Rank-1 updates, attention value computation |
Each term has a name and role:
| Term | Name | Meaning |
|---|---|---|
| \(P(H \mid D)\) | Posterior | Updated belief about hypothesis \(H\) after seeing data \(D\) |
| \(P(D \mid H)\) | Likelihood | How probable the data is if hypothesis \(H\) is true |
| \(P(H)\) | Prior | Initial belief before seeing data |
| \(P(D)\) | Evidence | Total probability of the data (normalizer) |
Gaussian (Normal): \(p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)\). Maximizing the likelihood of Gaussian data leads directly to minimizing squared error — which is why least squares regression is so fundamental.
Bernoulli: \(P(y=1) = p\), \(P(y=0) = 1-p\). The basis of binary classification. Maximizing Bernoulli likelihood leads to the cross-entropy loss.
Categorical / Softmax: Generalizes Bernoulli to \(K\) classes. \(P(y = k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}\). This is the softmax function — it appears in logistic regression, neural network outputs, and attention mechanisms.
Given data \(\mathcal{D} = \{x_1, \ldots, x_N\}\), find the parameters \(\theta\) that maximize the probability of observing this data:
We take the log because: (1) products become sums (easier to work with), (2) log is monotonic so the maximum doesn't change, (3) it avoids numerical underflow from multiplying many small probabilities.
For a function \(f(\theta_1, \theta_2, \ldots, \theta_d)\), the gradient is the vector of all partial derivatives:
The gradient points in the direction of steepest increase. To minimize a loss function, we go in the opposite direction.
Where \(\eta\) is the learning rate — the step size. Too large → you overshoot the minimum and diverge. Too small → you converge painfully slowly. This single hyperparameter is arguably the most important in all of deep learning.
Computing the gradient over all \(N\) data points is expensive. SGD approximates the true gradient using a random subset (mini-batch) of size \(B\):
This is noisier but much faster, and the noise actually helps escape shallow local minima. Modern deep learning almost exclusively uses mini-batch SGD or its adaptive variants (Adam, AdamW).
If \(y = f(g(x))\), then:
This extends to arbitrarily long chains — which is exactly what a deep neural network is. Backpropagation is nothing more than the systematic application of the chain rule through the computational graph.
We assume a linear relationship between input features \(\mathbf{x}\) and output \(y\):
Where \(\mathbf{w} \in \mathbb{R}^d\) are weights (how much each feature matters), \(b\) is the bias (the baseline prediction when all features are 0), and \(\hat{y}\) is the prediction.
We assume each observation has Gaussian noise: \(y_i = \mathbf{w}^T \mathbf{x}_i + b + \epsilon_i\) where \(\epsilon_i \sim \mathcal{N}(0, \sigma^2)\).
The likelihood of all data is:
$$p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(y_i - \mathbf{w}^T \mathbf{x}_i - b)^2}{2\sigma^2}\right)$$Taking the negative log likelihood (since we minimize):
The constant terms from the Gaussian vanish because they don't depend on \(\mathbf{w}\). MSE loss is not an arbitrary choice — it's the MLE-optimal loss when noise is Gaussian.
Setting the gradient to zero and solving (absorbing \(b\) into \(\mathbf{w}\) by adding a column of 1s to \(\mathbf{X}\)):
Derivation: Write the loss in matrix form: \(\mathcal{L} = \frac{1}{N}(\mathbf{y} - \mathbf{X}\mathbf{w})^T(\mathbf{y} - \mathbf{X}\mathbf{w})\). Expand, take the gradient w.r.t. \(\mathbf{w}\), set to zero:
$$\nabla_\mathbf{w} \mathcal{L} = -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0 \implies \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$$For large datasets where the matrix inverse is expensive, we use gradient descent instead:
Read this equation: The update to weight \(w_j\) is proportional to the average of (prediction error) × (the corresponding feature value). If the model over-predicts (\(\hat{y}_i > y_i\)) and feature \(x_{ij}\) is positive, we decrease \(w_j\).
Properties: \(\sigma(0) = 0.5\), \(\lim_{z \to \infty} \sigma(z) = 1\), \(\lim_{z \to -\infty} \sigma(z) = 0\). Its derivative has a beautiful form: \(\sigma'(z) = \sigma(z)(1 - \sigma(z))\) — which simplifies gradient calculations enormously.
Each label follows a Bernoulli: \(P(y_i \mid \mathbf{x}_i) = \hat{p}_i^{y_i}(1 - \hat{p}_i)^{1 - y_i}\) where \(\hat{p}_i = \sigma(\mathbf{w}^T\mathbf{x}_i + b)\).
Negative log-likelihood:
Interpreting each term: When \(y_i = 1\), only the first term survives — it penalizes low \(\hat{p}_i\). When \(y_i = 0\), only the second term survives — it penalizes high \(\hat{p}_i\). The loss is 0 when predictions perfectly match labels.
The gradient turns out to be strikingly elegant:
Derivation sketch: Using \(\frac{\partial}{\partial z}[-y\log\sigma(z) - (1-y)\log(1-\sigma(z))] = \sigma(z) - y\), and then applying the chain rule \(\frac{\partial z}{\partial w_j} = x_j\).
Notice this has exactly the same form as the linear regression gradient! The error signal \((\hat{p}_i - y_i)\) gets multiplied by the input feature \(x_{ij}\). This is not a coincidence — both are members of the generalized linear model family.
Logistic regression defines a linear decision boundary: the hyperplane where \(\mathbf{w}^T\mathbf{x} + b = 0\) (i.e., \(\hat{p} = 0.5\)). On one side, \(\hat{p} > 0.5\) (predict class 1); on the other, \(\hat{p} < 0.5\) (predict class 0). The weight vector \(\mathbf{w}\) is perpendicular to this boundary and points toward the class-1 side.
| Loss | Formula | When to Use | Probabilistic Origin |
|---|---|---|---|
| MSE | \(\frac{1}{N}\sum(y_i - \hat{y}_i)^2\) | Regression | Gaussian likelihood |
| Binary CE | \(-\frac{1}{N}\sum[y\log\hat{p} + (1-y)\log(1-\hat{p})]\) | Binary classification | Bernoulli likelihood |
| Categorical CE | \(-\frac{1}{N}\sum\sum_{k} y_k \log \hat{p}_k\) | Multi-class | Categorical likelihood |
If you compose two linear functions, \(f(\mathbf{x}) = \mathbf{W}_2(\mathbf{W}_1 \mathbf{x}) = (\mathbf{W}_2 \mathbf{W}_1)\mathbf{x}\), you just get another linear function. Stacking linear layers without activations gives you nothing new. The activation function breaks this linearity.
Where \(\phi\) is a nonlinear activation function. Common choices:
| Name | Formula | Derivative | Notes |
|---|---|---|---|
| Sigmoid | \(\frac{1}{1+e^{-z}}\) | \(\sigma(z)(1-\sigma(z))\) | Squashes to (0,1); causes vanishing gradients |
| Tanh | \(\frac{e^z - e^{-z}}{e^z + e^{-z}}\) | \(1 - \tanh^2(z)\) | Squashes to (-1,1); zero-centered |
| ReLU | \(\max(0, z)\) | \(\begin{cases}1 & z > 0\\0 & z \leq 0\end{cases}\) | Simple, fast; default choice for hidden layers |
| GELU | \(z \cdot \Phi(z)\) | smooth approx. | Used in transformers; smooth version of ReLU |
An \(L\)-layer network computes:
Where \(\mathbf{a}^{[0]} = \mathbf{x}\) (input), and the final layer's activation depends on the task (sigmoid for binary, softmax for multi-class, identity for regression).
Step 1: Output layer error
(This elegant result comes from the cancellation between the BCE derivative and sigmoid derivative.)
Step 2: Output layer gradients
Step 3: Propagate error backward
Where \(\odot\) is element-wise multiplication. Each hidden neuron's error is proportional to: (1) how strongly it connects to the output, times (2) the output error, times (3) the local derivative of its activation.
Step 4: Hidden layer gradients
This recurses from the last layer to the first. The gradient for each weight is: (error at that layer) × (activation from the previous layer). This is why it's called backpropagation — the error signal propagates backward through the network.
A filter (kernel) \(\mathbf{K} \in \mathbb{R}^{k \times k}\) slides over the input, computing a dot product at each position:
Each term: \(I_{i+m, j+n}\) is the pixel at position \((i+m, j+n)\); \(K_{m,n}\) is the filter weight. The sum is a dot product between the local image patch and the filter.
Key insight: The same filter is applied at every spatial location (weight sharing). If a filter learns to detect vertical edges, it can detect them anywhere in the image. This is translational equivariance.
Input → [Conv → ReLU → Pool]×N → Flatten → Fully-Connected → Output
Pooling (e.g., max-pooling) reduces spatial dimensions, providing some translation invariance and reducing computation. A 2×2 max pool takes the maximum in each 2×2 block, halving both height and width.
Layer 1 filters learn low-level features: edges, color gradients, simple textures. Layer 2 combines these into mid-level features: corners, curves, small texture patterns. Deeper layers learn increasingly abstract features: object parts, shapes, eventually entire object categories. The network builds a compositional representation of visual reality.
At each time step \(t\), the hidden state \(\mathbf{h}_t\) is a function of the previous state \(\mathbf{h}_{t-1}\) and the current input \(\mathbf{x}_t\). The same weights \(\mathbf{W}_{hh}, \mathbf{W}_{xh}\) are shared across all time steps.
During backpropagation through time (BPTT), the gradient at time step \(t\) depends on a product of Jacobians: \(\prod_{k=t}^{T} \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}}\). For long sequences, this product either explodes or vanishes, making it impossible to learn long-range dependencies.
The key idea: The cell state \(\mathbf{c}_t\) is an "information highway" — it flows through time with only additive updates, so gradients can flow backward without vanishing. The forget gate decides what to erase from memory; the input gate decides what new information to store.
"You shall know a word by the company it keeps" — J.R. Firth, 1957. Words that appear in similar contexts have similar meanings. "Dog" and "cat" appear near "pet," "fur," "vet." This insight is the foundation of all modern word embeddings.
Setup: Given a center word \(w_c\), predict the surrounding context words \(w_o\). Each word has two embedding vectors: \(\mathbf{v}_w\) (when it's the center word) and \(\mathbf{u}_w\) (when it's a context word).
Probability model:
This is a softmax over the entire vocabulary. The numerator is the dot product between the context word's embedding and the center word's embedding — a measure of compatibility. The denominator normalizes across all possible words.
Training objective: Maximize the log-probability of observed (center, context) pairs:
$$\mathcal{L} = \sum_{t=1}^{T}\sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j} \mid w_t)$$Negative Sampling approximation: Computing the full softmax over 100K+ words is prohibitive. Instead, for each positive pair, we sample \(k\) "negative" words (random words unlikely to be in the context) and train a binary classifier:
This says: maximize the dot product with the true context word (positive pair) while minimizing the dot product with random words (negative pairs). The result: words appearing in similar contexts get pulled together in vector space.
Key insight: Let \(X_{ij}\) be the count of how often word \(j\) appears in the context of word \(i\). GloVe optimizes:
Where \(f(X_{ij})\) is a weighting function that prevents very common co-occurrences from dominating. The objective says: the dot product of two word vectors should approximate the log of how often they co-occur.
Modern LLMs don't embed whole words — they use subword tokens. Byte Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent pairs: "l o w e r" → "lo w e r" → "low e r" → "low er" → "lower". This handles unseen words by decomposing them into known subwords, and reduces vocabulary size while maintaining expressiveness.
This is the section that makes everything else click. The transformer architecture, built entirely on the attention mechanism, is the foundation of all modern language models.
Think of attention as a soft dictionary lookup. You have a query (what you're looking for), a set of keys (labels in the dictionary), and values (the stored information). The query is compared to all keys to compute relevance scores, and the output is a weighted sum of values.
Step 1: Measure similarity. Given a query \(\mathbf{q}\) and a key \(\mathbf{k}\), their similarity is the dot product \(\mathbf{q}^T \mathbf{k}\). This measures how aligned the query is with the key in the embedding space.
Step 2: Scale. The dot product grows with dimensionality \(d_k\), which pushes softmax into saturated regions with tiny gradients. We divide by \(\sqrt{d_k}\):
$$\text{score}(q, k) = \frac{\mathbf{q}^T \mathbf{k}}{\sqrt{d_k}}$$Step 3: Normalize to get attention weights.
$$\alpha_i = \text{softmax}_i = \frac{\exp(\text{score}(q, k_i))}{\sum_j \exp(\text{score}(q, k_j))}$$Step 4: Compute weighted output.
$$\text{output} = \sum_i \alpha_i \mathbf{v}_i$$Putting it all together in matrix form:
Where \(\mathbf{Q} \in \mathbb{R}^{n \times d_k}\), \(\mathbf{K} \in \mathbb{R}^{m \times d_k}\), \(\mathbf{V} \in \mathbb{R}^{m \times d_v}\). The output is \(\mathbb{R}^{n \times d_v}\).
In self-attention, the queries, keys, and values all come from the same sequence. Each token attends to all other tokens (including itself) in the sequence:
Where \(\mathbf{X} \in \mathbb{R}^{n \times d}\) is the sequence of input embeddings, and \(\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d \times d_k}\), \(\mathbf{W}^V \in \mathbb{R}^{d \times d_v}\) are learned projection matrices.
What does self-attention compute? For each token, it asks "which other tokens in this sequence are relevant to understanding me?" and creates a new representation that blends information from those relevant tokens. In "The animal didn't cross the street because it was too wide," self-attention at the word "it" can learn to assign high weight to "street" (the thing that's wide), resolving the reference.
If the model dimension is \(d = 512\) and we use \(h = 8\) heads, each head uses \(d_k = d_v = d/h = 64\). The total computation is similar to single-head attention with full dimensionality, but we get \(h\) different attention patterns.
Each encoder layer applies two sub-layers with residual connections and layer normalization:
Where the feed-forward network (FFN) is applied position-wise:
$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \, \phi(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$Typically \(\mathbf{W}_1 \in \mathbb{R}^{d \times 4d}\), \(\mathbf{W}_2 \in \mathbb{R}^{4d \times d}\) — the FFN expands to 4× the dimension and then projects back. This is where "knowledge" is stored (the attention layers route information; the FFN layers process it).
Self-attention is permutation-invariant — it doesn't know word order. Positional encodings inject position information:
Each dimension oscillates at a different frequency — low dimensions change slowly (encoding coarse position), high dimensions change rapidly (encoding fine position). This allows the model to attend to relative positions.
The decoder adds causal (masked) self-attention — each position can only attend to previous positions, preventing the model from "seeing the future":
Where \(M_{ij} = 0\) if \(i \geq j\), and \(M_{ij} = -\infty\) otherwise. The \(-\infty\) values become 0 after softmax, effectively masking future positions.
A full transformer for language modeling: Input tokens → Embedding + Positional Encoding → N × Decoder Blocks → Linear projection → Softmax over vocabulary → Predicted next token.
GPT is a decoder-only transformer. Its single training objective is strikingly simple:
For each position \(t\), the model predicts the probability distribution over the entire vocabulary for the next token, given all previous tokens. The loss is the cross-entropy between this prediction and the actual next token.
How this single objective produces intelligence: To predict the next word well, the model must learn grammar, facts, reasoning, common sense, style, and much more. The prediction task is a "universal" task that requires understanding language at every level.
The transformer's output at position \(t\) is a vector \(\mathbf{h}_t \in \mathbb{R}^d\). We project it to vocabulary size and apply softmax:
Where \(\mathbf{e}_w\) is the embedding of token \(w\). Often the output embedding matrix is tied (shared) with the input embedding matrix.
The cross-entropy loss follows a power law:
Where \(N\) is the number of parameters and \(\alpha_N \approx 0.076\). Similar power laws hold for dataset size and compute. The Chinchilla finding: the compute-optimal ratio is roughly 20 tokens per parameter — a 10B parameter model should be trained on ~200B tokens.
Simpler and faster than LayerNorm (no mean subtraction). Used in LLaMA, GPT-4, and most modern LLMs.
Instead of adding positional information, RoPE encodes relative position directly into the attention computation by rotating query and key vectors:
Where \(\mathbf{R}_m\) is a rotation matrix that depends on position \(m\). The dot product then depends on the relative position \(m - n\), which is more natural for language understanding.
A gated variant that outperforms plain ReLU in practice. Used in LLaMA, PaLM, and most recent models.
Standard multi-head attention uses separate K and V projections per head, which is expensive during inference (the KV cache for all heads must be stored). GQA groups heads so that multiple query heads share the same K and V projections, reducing KV cache by a factor of the group size without significant quality loss.
Where \(E_i\) are expert FFN networks, and \(\mathbf{W}_g\) is the router/gating network. For each token, only the top-K experts (e.g., K = 2 out of 64) are activated. This means a model with 236B total parameters might only use ~21B parameters per token.
Fine-grained experts: Instead of 8 large experts, use 64+ smaller experts — more flexible routing. Shared experts: Some experts are always active (capturing common knowledge), while routed experts specialize. Auxiliary load-balancing loss: Without regularization, the router can "collapse" and send all tokens to a few experts. An auxiliary loss encourages balanced expert utilization:
$$\mathcal{L}_{\text{balance}} = \alpha \sum_{i=1}^{N} f_i \cdot P_i$$Where \(f_i\) is the fraction of tokens routed to expert \(i\) and \(P_i\) is the average routing probability to expert \(i\).
The training pipeline has three stages:
Stage 1: Supervised Fine-Tuning (SFT). Fine-tune the pretrained model on high-quality (prompt, response) examples written by humans.
Stage 2: Reward Model Training. Collect human preferences: given a prompt and two responses, which is better? Train a reward model \(R_\phi(x, y)\) to predict these preferences using the Bradley-Terry model:
Stage 3: RL Optimization. Use PPO (Proximal Policy Optimization) to maximize the reward while staying close to the SFT policy:
The KL divergence term prevents the model from diverging too far from the reference policy (which would lead to reward hacking — finding outputs that score high on the reward model but are actually nonsensical).
A simpler alternative that skips the reward model entirely. Starting from the RLHF objective's closed-form solution, DPO directly optimizes:
Where \(y_w\) is the preferred response and \(y_l\) is the rejected response.
Consider binary classification of images (human vs. non-human). Each image is a point in \(\mathbb{R}^{784}\) (for 28×28 pixels). These points form complex, interleaved clusters. A deep network learns a function \(f: \mathbb{R}^{784} \to \mathbb{R}^{d}\) that maps these points to a new space where they are linearly separable.
Each layer performs two operations:
1. Affine transformation (\(\mathbf{Wx} + \mathbf{b}\)): rotates, scales, and shifts the data.
2. Nonlinear activation (\(\phi\)): folds and warps the space (ReLU "folds" by zeroing negative regions).
A classifier estimates the posterior probability \(P(y = k \mid \mathbf{x})\). By Bayes' theorem:
$$P(y = k \mid \mathbf{x}) = \frac{P(\mathbf{x} \mid y = k) \, P(y = k)}{P(\mathbf{x})}$$The model learns the class-conditional density \(P(\mathbf{x} \mid y = k)\) — the "pattern" of what class-\(k\) data looks like — and the prior \(P(y = k)\). Classification becomes: which class makes the observed data most likely?
Modern NLP models don't use explicit class-conditional densities. Instead, they operate in learned embedding spaces where patterns are encoded as proximity:
Distance metrics and their roles:
| Metric | Formula | Use |
|---|---|---|
| Euclidean | \(\|\mathbf{a} - \mathbf{b}\|_2\) | K-nearest neighbors, clustering |
| Cosine similarity | \(\frac{\mathbf{a}^T\mathbf{b}}{\|\mathbf{a}\|\|\mathbf{b}\|}\) | Word embeddings, sentence similarity, retrieval |
| Dot product | \(\mathbf{a}^T\mathbf{b}\) | Attention scores (unnormalized similarity) |
| Mahalanobis | \(\sqrt{(\mathbf{a}-\mathbf{b})^T\Sigma^{-1}(\mathbf{a}-\mathbf{b})}\) | Distance accounting for correlations |
In a transformer, the attention mechanism computes \(\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\) — this is a similarity matrix between every pair of tokens. High similarity means "these tokens are relevant to each other." The model learns the projection matrices \(\mathbf{W}^Q, \mathbf{W}^K\) such that semantically or syntactically related tokens have high dot-product similarity after projection. This is how a transformer discovers patterns like subject-verb agreement, coreference, and semantic relationships — entirely from the training signal of next-token prediction.
Recent interpretability research has revealed that transformer layers develop a hierarchy of increasingly abstract representations:
Early layers: encode surface-level patterns — token identity, local syntax, simple co-occurrence statistics.
Middle layers: encode semantic relationships — word meanings in context, entity types, relational knowledge ("Paris is-capital-of France").
Late layers: encode task-specific features — next-token predictions draw on abstract reasoning, world knowledge, and contextual understanding built by earlier layers.
| # | Paper | Year | Why It Matters |
|---|---|---|---|
| 1 | Rumelhart, Hinton & Williams — Backpropagation | 1986 | Made neural network training practical |
| 2 | LeCun et al. — LeNet / CNNs | 1998 | Proved deep learning works for vision |
| 3 | Hochreiter & Schmidhuber — LSTM | 1997 | Solved vanishing gradient for sequences |
| 4 | Mikolov et al. — Word2Vec | 2013 | Efficient word embeddings; word arithmetic |
| 5 | Pennington et al. — GloVe | 2014 | Global co-occurrence matrix factorization |
| 6 | Bahdanau et al. — Attention for NMT | 2015 | Introduced attention for seq2seq |
| 7 | Vaswani et al. — Transformer | 2017 | Replaced RNNs entirely; the foundation of LLMs |
| 8 | Devlin et al. — BERT | 2019 | Bidirectional pretraining; dominated NLU benchmarks |
| 9 | Radford et al. — GPT-2 | 2019 | Showed language modeling scales to generation quality |
| 10 | Brown et al. — GPT-3 | 2020 | In-context learning and few-shot capabilities at scale |
| 11 | Kaplan et al. — Scaling Laws | 2020 | Power-law relationships for compute-optimal training |
| 12 | Hoffmann et al. — Chinchilla | 2022 | Compute-optimal: more data, smaller model |
| 13 | Ouyang et al. — InstructGPT / RLHF | 2022 | Aligning models with human preferences |
| 14 | Touvron et al. — LLaMA | 2023 | Open-weight models competitive with proprietary |
| 15 | Rafailov et al. — DPO | 2023 | Simplified alignment without RL |
| 16 | DeepSeek Team — DeepSeek-V3 | 2024 | MoE with fine-grained routing; open-weight frontier model |
Weeks 1–2: Master Parts I–II. Implement linear and logistic regression from scratch (just NumPy). Derive every gradient by hand.
Weeks 3–4: Master Part III. Implement a 2-layer neural network from scratch with backpropagation. Train it on MNIST.
Weeks 5–6: Study Part IV. Implement a simple CNN in PyTorch. Implement a character-level RNN language model.
Weeks 7–8: Study Parts V–VI. Read the Word2Vec and Transformer papers. Implement self-attention from scratch. Then implement a full transformer decoder in PyTorch.
Weeks 9–12: Study Part VII. Read GPT-2 and GPT-3 papers. Fine-tune a small LLM. Study MoE and RLHF architectures. Read DeepSeek papers.
Ongoing: Follow ArXiv for new developments. Join the Hugging Face community. Read interpretability papers (Olah, Elhage et al.) to understand what models learn, not just how they're trained.
Karpathy's "Neural Networks: Zero to Hero" — YouTube series that builds GPT from scratch. Karpathy's nanoGPT — minimal GPT implementation in ~600 lines of PyTorch. Hugging Face Transformers — production-grade implementations of every major architecture. The Annotated Transformer (Rush, 2018) — line-by-line walkthrough of the original transformer code.