A Complete Mathematical Journey

Machine Learning from First Principles
to Modern NLP

From vectors and probabilities through neural networks to transformers and large language models.
Every concept built from intuition → derivation → example → application.

Mathematical Foundations
- Linear Algebra Essentials · Probability & Statistics · Optimization Basics
Core Machine Learning
- Linear Regression · Logistic Regression · Loss Functions · Gradient Descent
Neural Networks
- The Perceptron · Multi-Layer Networks · Backpropagation Derivation · Activation Functions
Deep Learning Architectures
- CNNs · RNNs & LSTMs · Sequence-to-Sequence Models
NLP Foundations
- Word Embeddings · Word2Vec Derivation · GloVe · Subword Tokenization
Attention & Transformers
- Attention Mechanism · Self-Attention · Multi-Head Attention · Full Transformer Architecture
Modern Large Language Models
- GPT Architecture · Training Objectives · Scaling Laws · MoE & DeepSeek · RLHF
Pattern Detection — A Unified View
- Geometric Interpretation · Probabilistic Interpretation · How Models "See" Patterns
Annotated Reading List & Roadmap

Part I

Mathematical Foundations

We only cover the math you actually need. Three pillars support all of machine learning: linear algebra gives us the language to describe data and transformations; probability gives us the framework to reason under uncertainty; optimization gives us the engine to learn.

1.1 Linear Algebra Essentials

Think of data as points in space. A single house can be described by its square footage and number of bedrooms — that's a point in 2D space. An image of 784 pixels lives in 784-dimensional space. All of ML is about finding structure (lines, planes, curved surfaces) in these high-dimensional spaces.

Vectors

A vector $\mathbf{x} \in \mathbb{R}^n$ is an ordered list of $n$ numbers. It represents a single data point. For example, a house with 1500 sqft and 3 bedrooms is $\mathbf{x} = [1500, 3]^T$.

Dot product — the most important operation in ML

The dot product of two vectors $\mathbf{a}, \mathbf{b} \in \mathbb{R}^n$ is:

Definition — Dot Product $$\mathbf{a} \cdot \mathbf{b} = \sum_{i=1}^{n} a_i b_i = \|\mathbf{a}\| \|\mathbf{b}\| \cos\theta$$

The first form is algebraic: multiply corresponding elements and sum. The second is geometric: the product of their lengths scaled by the cosine of the angle between them.

Why this matters everywhere in ML: The dot product measures similarity. When $\theta = 0$ (same direction), cosine is 1 and the dot product is maximized. When $\theta = 90°$ (orthogonal), the dot product is 0. When $\theta = 180°$ (opposite), it's negative. Attention mechanisms, linear regression, neural network layers — all are built on dot products.

Let $\mathbf{a} = [1, 2, 3]$, $\mathbf{b} = [4, 5, 6]$.
$\mathbf{a} \cdot \mathbf{b} = 1 \times 4 + 2 \times 5 + 3 \times 6 = 4 + 10 + 18 = 32$.
$\|\mathbf{a}\| = \sqrt{1+4+9} = \sqrt{14} \approx 3.74$, $\|\mathbf{b}\| = \sqrt{16+25+36} = \sqrt{77} \approx 8.77$.
$\cos\theta = 32 / (3.74 \times 8.77) \approx 0.975$ — the vectors are very aligned.

Matrices and Linear Transformations

A matrix $\mathbf{W} \in \mathbb{R}^{m \times n}$ transforms a vector from $\mathbb{R}^n$ to $\mathbb{R}^m$. Every layer of a neural network does this: $\mathbf{y} = \mathbf{W}\mathbf{x} + \mathbf{b}$.

Matrix-Vector Product $$y_i = \sum_{j=1}^{n} W_{ij} x_j + b_i \qquad \text{for } i = 1, \ldots, m$$

Each row of $\mathbf{W}$ is a "detector" — its dot product with $\mathbf{x}$ measures how much $\mathbf{x}$ matches that pattern. A neural network learns what patterns to detect by adjusting $\mathbf{W}$.

Key operations summary

Operation	Notation	ML Role
Dot product	$\mathbf{a}^T \mathbf{b}$	Similarity, attention scores, linear predictions
Matrix multiply	$\mathbf{W}\mathbf{x}$	Linear transformations / neural network layers
Transpose	$\mathbf{A}^T$	Switching rows ↔ columns, computing $\mathbf{X}^T\mathbf{X}$
Norm	$\\|\mathbf{x}\\| = \sqrt{\sum x_i^2}$	Regularization, normalization, distance
Outer product	$\mathbf{a}\mathbf{b}^T$	Rank-1 updates, attention value computation

(a) Given $\mathbf{w} = [0.5, -1.0]$, $\mathbf{x}_1 = [4, 2]$, $\mathbf{x}_2 = [1, 5]$. Compute $\mathbf{w}^T\mathbf{x}_1$ and $\mathbf{w}^T\mathbf{x}_2$. Which data point does the weight vector "prefer"?
(b) Compute the cosine similarity between $\mathbf{x}_1$ and $\mathbf{x}_2$. Are they similar or dissimilar?

1.2 Probability & Statistics

ML models live in a world of uncertainty. A spam classifier doesn't say "this IS spam" — it says "the probability of spam is 0.97." Probability gives us the rigorous framework to make and update beliefs under uncertainty.

Bayes' Theorem — The Engine of Inference

Bayes' Theorem $$P(H \mid D) = \frac{P(D \mid H) \, P(H)}{P(D)}$$

Each term has a name and role:

Term	Name	Meaning
$P(H \mid D)$	Posterior	Updated belief about hypothesis $H$ after seeing data $D$
$P(D \mid H)$	Likelihood	How probable the data is if hypothesis $H$ is true
$P(H)$	Prior	Initial belief before seeing data
$P(D)$	Evidence	Total probability of the data (normalizer)

A medical test for a disease is 99% accurate (both sensitivity and specificity). The disease prevalence is 1%. You test positive — what's the probability you actually have the disease?

$P(\text{disease} \mid +) = \frac{P(+ \mid \text{disease}) \cdot P(\text{disease})}{P(+)} = \frac{0.99 \times 0.01}{0.99 \times 0.01 + 0.01 \times 0.99} = \frac{0.0099}{0.0198} = 0.5$

Only 50%! The low prevalence (prior) dramatically affects the posterior. This is why ML models need to account for class imbalance.

Key Distributions

Gaussian (Normal): $p(x) = \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(x - \mu)^2}{2\sigma^2}\right)$. Maximizing the likelihood of Gaussian data leads directly to minimizing squared error — which is why least squares regression is so fundamental.

Bernoulli: $P(y=1) = p$, $P(y=0) = 1-p$. The basis of binary classification. Maximizing Bernoulli likelihood leads to the cross-entropy loss.

Categorical / Softmax: Generalizes Bernoulli to $K$ classes. $P(y = k) = \frac{e^{z_k}}{\sum_{j=1}^{K} e^{z_j}}$. This is the softmax function — it appears in logistic regression, neural network outputs, and attention mechanisms.

Maximum Likelihood Estimation (MLE)

Given data $\mathcal{D} = \{x_1, \ldots, x_N\}$, find the parameters $\theta$ that maximize the probability of observing this data:

MLE Principle $$\hat{\theta}_{\text{MLE}} = \arg\max_\theta \prod_{i=1}^{N} p(x_i \mid \theta) = \arg\max_\theta \sum_{i=1}^{N} \log p(x_i \mid \theta)$$

We take the log because: (1) products become sums (easier to work with), (2) log is monotonic so the maximum doesn't change, (3) it avoids numerical underflow from multiplying many small probabilities.

MLE is the unifying principle behind almost all ML training. Linear regression minimizes squared error (= MLE under Gaussian noise). Logistic regression minimizes cross-entropy (= MLE under Bernoulli). Neural network training with cross-entropy loss is also MLE. When you understand MLE, you understand why we use these specific loss functions.

1.3 Optimization Basics

Imagine you're blindfolded on a hilly landscape, and you want to find the lowest valley. You can feel the slope under your feet. The strategy: always step in the direction that goes most steeply downhill. That's gradient descent.

Gradient — the Direction of Steepest Ascent

For a function $f(\theta_1, \theta_2, \ldots, \theta_d)$, the gradient is the vector of all partial derivatives:

Gradient $$\nabla_\theta f = \left[\frac{\partial f}{\partial \theta_1}, \frac{\partial f}{\partial \theta_2}, \ldots, \frac{\partial f}{\partial \theta_d}\right]^T$$

The gradient points in the direction of steepest increase. To minimize a loss function, we go in the opposite direction.

Gradient Descent Update Rule

Update Rule $$\theta^{(t+1)} = \theta^{(t)} - \eta \, \nabla_\theta \mathcal{L}(\theta^{(t)})$$

Where $\eta$ is the learning rate — the step size. Too large → you overshoot the minimum and diverge. Too small → you converge painfully slowly. This single hyperparameter is arguably the most important in all of deep learning.

Stochastic Gradient Descent (SGD)

Computing the gradient over all $N$ data points is expensive. SGD approximates the true gradient using a random subset (mini-batch) of size $B$:

\nabla_\theta \mathcal{L} \approx \frac{1}{B} \sum_{i \in \text{batch}} \nabla_\theta \mathcal{L}_i(\theta)

This is noisier but much faster, and the noise actually helps escape shallow local minima. Modern deep learning almost exclusively uses mini-batch SGD or its adaptive variants (Adam, AdamW).

The Chain Rule — Foundation of Backpropagation

If $y = f(g(x))$, then:

Chain Rule $$\frac{dy}{dx} = \frac{dy}{dg} \cdot \frac{dg}{dx}$$

This extends to arbitrarily long chains — which is exactly what a deep neural network is. Backpropagation is nothing more than the systematic application of the chain rule through the computational graph.

(a) Let $f(\theta) = (\theta - 3)^2$. Compute $\nabla_\theta f$. Starting at $\theta_0 = 0$, perform 3 steps of gradient descent with $\eta = 0.1$. Where do you end up?
(b) Let $y = \sigma(w \cdot x + b)$ where $\sigma(z) = 1/(1 + e^{-z})$. Using the chain rule, compute $\frac{\partial y}{\partial w}$.

· · ·

Part II

Core Machine Learning

2.1 Linear Regression — Full Derivation

You're a real-estate agent. You notice that bigger houses cost more. You want to draw the "best fitting line" through the data so you can predict the price of any new house. Linear regression finds that line by minimizing the total prediction error.

The Model

We assume a linear relationship between input features $\mathbf{x}$ and output $y$:

Linear Model $$\hat{y} = \mathbf{w}^T \mathbf{x} + b = \sum_{j=1}^{d} w_j x_j + b$$

Where $\mathbf{w} \in \mathbb{R}^d$ are weights (how much each feature matters), $b$ is the bias (the baseline prediction when all features are 0), and $\hat{y}$ is the prediction.

The Loss Function — Why Squared Error?

We assume each observation has Gaussian noise: $y_i = \mathbf{w}^T \mathbf{x}_i + b + \epsilon_i$ where $\epsilon_i \sim \mathcal{N}(0, \sigma^2)$.

The likelihood of all data is:

$$p(\mathbf{y} \mid \mathbf{X}, \mathbf{w}) = \prod_{i=1}^{N} \frac{1}{\sqrt{2\pi}\sigma} \exp\!\left(-\frac{(y_i - \mathbf{w}^T \mathbf{x}_i - b)^2}{2\sigma^2}\right)$$

Taking the negative log likelihood (since we minimize):

Mean Squared Error (from MLE) $$\mathcal{L}(\mathbf{w}, b) = \frac{1}{N}\sum_{i=1}^{N} (y_i - \hat{y}_i)^2 = \frac{1}{N}\sum_{i=1}^{N} (y_i - \mathbf{w}^T \mathbf{x}_i - b)^2$$

The constant terms from the Gaussian vanish because they don't depend on $\mathbf{w}$. MSE loss is not an arbitrary choice — it's the MLE-optimal loss when noise is Gaussian.

Closed-Form Solution (Normal Equation)

Setting the gradient to zero and solving (absorbing $b$ into $\mathbf{w}$ by adding a column of 1s to $\mathbf{X}$):

Normal Equation $$\hat{\mathbf{w}} = (\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$$

Derivation: Write the loss in matrix form: $\mathcal{L} = \frac{1}{N}(\mathbf{y} - \mathbf{X}\mathbf{w})^T(\mathbf{y} - \mathbf{X}\mathbf{w})$. Expand, take the gradient w.r.t. $\mathbf{w}$, set to zero:

$$\nabla_\mathbf{w} \mathcal{L} = -\frac{2}{N}\mathbf{X}^T(\mathbf{y} - \mathbf{X}\mathbf{w}) = 0 \implies \mathbf{X}^T\mathbf{X}\mathbf{w} = \mathbf{X}^T\mathbf{y}$$

Gradient Descent Solution

For large datasets where the matrix inverse is expensive, we use gradient descent instead:

Gradient Update for Linear Regression $$w_j \leftarrow w_j - \eta \cdot \frac{2}{N} \sum_{i=1}^{N} ({\hat{y}_i - y_i}) \cdot x_{ij}$$

Read this equation: The update to weight $w_j$ is proportional to the average of (prediction error) × (the corresponding feature value). If the model over-predicts ($\hat{y}_i > y_i$) and feature $x_{ij}$ is positive, we decrease $w_j$.

Data: $(x=1, y=2)$, $(x=2, y=4)$, $(x=3, y=5)$. Model: $\hat{y} = wx + b$.

Normal equation (with $\mathbf{X}$ augmented with 1s column):
$\mathbf{X} = \begin{bmatrix} 1 & 1 \\ 1 & 2 \\ 1 & 3\end{bmatrix}$, $\mathbf{y} = \begin{bmatrix}2\\4\\5\end{bmatrix}$

$\mathbf{X}^T\mathbf{X} = \begin{bmatrix}3 & 6\\6 & 14\end{bmatrix}$, $\mathbf{X}^T\mathbf{y} = \begin{bmatrix}11\\23\end{bmatrix}$

$\hat{\mathbf{w}} = \begin{bmatrix}3 & 6\\6 & 14\end{bmatrix}^{-1}\begin{bmatrix}11\\23\end{bmatrix} = \frac{1}{6}\begin{bmatrix}14 & -6\\-6 & 3\end{bmatrix}\begin{bmatrix}11\\23\end{bmatrix} = \begin{bmatrix}2/3 \\ 3/2\end{bmatrix}$

So $b \approx 0.67$, $w = 1.5$. The best line is $\hat{y} = 1.5x + 0.67$.

2.2 Logistic Regression — Full Derivation

Now you're classifying emails as spam or not-spam. The output should be a probability between 0 and 1. A linear function $\mathbf{w}^T\mathbf{x}$ can produce any real number, so we squeeze it through the sigmoid function to get a probability.

The Sigmoid Function

Sigmoid $$\sigma(z) = \frac{1}{1 + e^{-z}} = \frac{e^z}{1 + e^z}$$

Properties: $\sigma(0) = 0.5$, $\lim_{z \to \infty} \sigma(z) = 1$, $\lim_{z \to -\infty} \sigma(z) = 0$. Its derivative has a beautiful form: $\sigma'(z) = \sigma(z)(1 - \sigma(z))$ — which simplifies gradient calculations enormously.

The Model

P(y = 1 \mid \mathbf{x}) = \sigma(\mathbf{w}^T \mathbf{x} + b)

Loss Function — Deriving Cross-Entropy from MLE

Each label follows a Bernoulli: $P(y_i \mid \mathbf{x}_i) = \hat{p}_i^{y_i}(1 - \hat{p}_i)^{1 - y_i}$ where $\hat{p}_i = \sigma(\mathbf{w}^T\mathbf{x}_i + b)$.

Negative log-likelihood:

Binary Cross-Entropy Loss $$\mathcal{L} = -\frac{1}{N}\sum_{i=1}^{N}\left[y_i \log \hat{p}_i + (1 - y_i) \log(1 - \hat{p}_i)\right]$$

Interpreting each term: When $y_i = 1$, only the first term survives — it penalizes low $\hat{p}_i$. When $y_i = 0$, only the second term survives — it penalizes high $\hat{p}_i$. The loss is 0 when predictions perfectly match labels.

Gradient Derivation

The gradient turns out to be strikingly elegant:

Gradient of Cross-Entropy w.r.t. Weights $$\frac{\partial \mathcal{L}}{\partial w_j} = \frac{1}{N}\sum_{i=1}^{N} (\hat{p}_i - y_i) \cdot x_{ij}$$

Derivation sketch: Using $\frac{\partial}{\partial z}[-y\log\sigma(z) - (1-y)\log(1-\sigma(z))] = \sigma(z) - y$, and then applying the chain rule $\frac{\partial z}{\partial w_j} = x_j$.

Notice this has exactly the same form as the linear regression gradient! The error signal $(\hat{p}_i - y_i)$ gets multiplied by the input feature $x_{ij}$. This is not a coincidence — both are members of the generalized linear model family.

Geometric Interpretation — The Decision Boundary

Logistic regression defines a linear decision boundary: the hyperplane where $\mathbf{w}^T\mathbf{x} + b = 0$ (i.e., $\hat{p} = 0.5$). On one side, $\hat{p} > 0.5$ (predict class 1); on the other, $\hat{p} < 0.5$ (predict class 0). The weight vector $\mathbf{w}$ is perpendicular to this boundary and points toward the class-1 side.

(a) A spam classifier has $w_{\text{free}} = 2.0$, $w_{\text{meeting}} = -1.5$, $b = -0.5$. For an email with features $x_{\text{free}} = 1$, $x_{\text{meeting}} = 1$, compute $P(\text{spam})$.
(b) If the true label is $y = 1$ (spam), compute the binary cross-entropy loss for this example.
(c) Compute the gradient update for $w_{\text{free}}$ with $\eta = 0.1$.

2.3 Loss Functions — A Unified View

Loss	Formula	When to Use	Probabilistic Origin
MSE	$\frac{1}{N}\sum(y_i - \hat{y}_i)^2$	Regression	Gaussian likelihood
Binary CE	$-\frac{1}{N}\sum[y\log\hat{p} + (1-y)\log(1-\hat{p})]$	Binary classification	Bernoulli likelihood
Categorical CE	$-\frac{1}{N}\sum\sum_{k} y_k \log \hat{p}_k$	Multi-class	Categorical likelihood

Every standard loss function is just the negative log-likelihood under a specific probabilistic assumption about the data. Choosing a loss function = choosing your noise model.

· · ·

Part III

Neural Networks

3.1 From Linear Models to Neural Networks

Linear regression can only draw straight lines. But what if the relationship between house size and price is curved? We need a way to bend the line. That's what a neural network does: it stacks linear transformations with nonlinear "activations" to create arbitrarily complex functions.

Why linearity isn't enough

If you compose two linear functions, $f(\mathbf{x}) = \mathbf{W}_2(\mathbf{W}_1 \mathbf{x}) = (\mathbf{W}_2 \mathbf{W}_1)\mathbf{x}$, you just get another linear function. Stacking linear layers without activations gives you nothing new. The activation function breaks this linearity.

A Single Neuron

Neuron $$a = \phi(\mathbf{w}^T \mathbf{x} + b)$$

Where $\phi$ is a nonlinear activation function. Common choices:

Name	Formula	Derivative	Notes
Sigmoid	$\frac{1}{1+e^{-z}}$	$\sigma(z)(1-\sigma(z))$	Squashes to (0,1); causes vanishing gradients
Tanh	$\frac{e^z - e^{-z}}{e^z + e^{-z}}$	$1 - \tanh^2(z)$	Squashes to (-1,1); zero-centered
ReLU	$\max(0, z)$	$\begin{cases}1 & z > 0\\0 & z \leq 0\end{cases}$	Simple, fast; default choice for hidden layers
GELU	$z \cdot \Phi(z)$	smooth approx.	Used in transformers; smooth version of ReLU

3.2 Multi-Layer Network (MLP) — Forward Pass

An $L$-layer network computes:

Forward Pass $$\mathbf{z}^{[l]} = \mathbf{W}^{[l]} \mathbf{a}^{[l-1]} + \mathbf{b}^{[l]}$$ $$\mathbf{a}^{[l]} = \phi(\mathbf{z}^{[l]})$$

Where $\mathbf{a}^{[0]} = \mathbf{x}$ (input), and the final layer's activation depends on the task (sigmoid for binary, softmax for multi-class, identity for regression).

2-layer network, 2 inputs, 2 hidden units, 1 output (with sigmoid):
Input: $\mathbf{x} = [1, 0.5]^T$
$\mathbf{W}^{[1]} = \begin{bmatrix}0.1 & 0.3\\0.2 & 0.4\end{bmatrix}$, $\mathbf{b}^{[1]} = \begin{bmatrix}0\\0\end{bmatrix}$
$\mathbf{W}^{[2]} = \begin{bmatrix}0.5 & 0.6\end{bmatrix}$, $b^{[2]} = 0$

Layer 1: $\mathbf{z}^{[1]} = \begin{bmatrix}0.1(1) + 0.3(0.5)\\0.2(1)+0.4(0.5)\end{bmatrix} = \begin{bmatrix}0.25\\0.40\end{bmatrix}$
$\mathbf{a}^{[1]} = \text{ReLU}(\mathbf{z}^{[1]}) = \begin{bmatrix}0.25\\0.40\end{bmatrix}$ (both positive, ReLU passes them through)

Layer 2: $z^{[2]} = 0.5(0.25) + 0.6(0.40) = 0.365$
$a^{[2]} = \sigma(0.365) = \frac{1}{1 + e^{-0.365}} \approx 0.59$

The network outputs probability ≈ 0.59.

3.3 Backpropagation — Full Derivation

Backpropagation answers one question: how should I adjust each weight to reduce the loss? It works backwards from the output, using the chain rule to attribute "blame" for the error to each weight in the network.

Step-by-step for a 2-layer network with sigmoid output and BCE loss

Step 1: Output layer error

\delta^{[2]} = \frac{\partial \mathcal{L}}{\partial z^{[2]}} = a^{[2]} - y

(This elegant result comes from the cancellation between the BCE derivative and sigmoid derivative.)

Step 2: Output layer gradients

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}} = \delta^{[2]} \cdot (\mathbf{a}^{[1]})^T \qquad \frac{\partial \mathcal{L}}{\partial b^{[2]}} = \delta^{[2]}

Step 3: Propagate error backward

\boldsymbol{\delta}^{[1]} = (\mathbf{W}^{[2]})^T \delta^{[2]} \odot \phi'(\mathbf{z}^{[1]})

Where $\odot$ is element-wise multiplication. Each hidden neuron's error is proportional to: (1) how strongly it connects to the output, times (2) the output error, times (3) the local derivative of its activation.

Step 4: Hidden layer gradients

\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[1]}} = \boldsymbol{\delta}^{[1]} \cdot \mathbf{x}^T \qquad \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{[1]}} = \boldsymbol{\delta}^{[1]}

The General Pattern

Backpropagation — General Form $$\boldsymbol{\delta}^{[l]} = \left[(\mathbf{W}^{[l+1]})^T \boldsymbol{\delta}^{[l+1]}\right] \odot \phi'(\mathbf{z}^{[l]})$$ $$\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[l]}} = \boldsymbol{\delta}^{[l]} (\mathbf{a}^{[l-1]})^T$$

This recurses from the last layer to the first. The gradient for each weight is: (error at that layer) × (activation from the previous layer). This is why it's called backpropagation — the error signal propagates backward through the network.

Vanishing gradients: In deep networks with sigmoid/tanh activations, the gradient at early layers becomes a product of many derivatives less than 1 — it shrinks exponentially. This is why ReLU (derivative = 1 for positive inputs) became standard: it lets gradients flow unchanged through active neurons.

Rumelhart, Hinton & Williams (1986) — "Learning representations by back-propagating errors." The foundational paper that made neural network training practical. It showed that gradient descent + chain rule can learn useful internal representations in hidden layers.

Continuing the numerical example above (output = 0.59, true label $y = 1$):
(a) Compute $\delta^{[2]} = a^{[2]} - y$.
(b) Compute $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{[2]}}$.
(c) Compute $\boldsymbol{\delta}^{[1]}$ (using ReLU derivative = 1 since both units were positive).
(d) Update all weights with $\eta = 0.1$.

· · ·

Part IV

Deep Learning Architectures

4.1 Convolutional Neural Networks

When you look at a photo, you don't analyze every pixel in relation to every other pixel. You first detect local features — edges, textures, colors — and then combine them into larger patterns (eyes, noses) and eventually whole objects (faces). CNNs mimic this hierarchy.

The Convolution Operation

A filter (kernel) $\mathbf{K} \in \mathbb{R}^{k \times k}$ slides over the input, computing a dot product at each position:

2D Convolution $$(\mathbf{I} * \mathbf{K})_{i,j} = \sum_{m=0}^{k-1}\sum_{n=0}^{k-1} I_{i+m, j+n} \cdot K_{m,n}$$

Each term: $I_{i+m, j+n}$ is the pixel at position $(i+m, j+n)$; $K_{m,n}$ is the filter weight. The sum is a dot product between the local image patch and the filter.

Key insight: The same filter is applied at every spatial location (weight sharing). If a filter learns to detect vertical edges, it can detect them anywhere in the image. This is translational equivariance.

CNN Architecture Pattern

Input → [Conv → ReLU → Pool]×N → Flatten → Fully-Connected → Output

Pooling (e.g., max-pooling) reduces spatial dimensions, providing some translation invariance and reducing computation. A 2×2 max pool takes the maximum in each 2×2 block, halving both height and width.

How CNNs Learn Pattern Hierarchies

Layer 1 filters learn low-level features: edges, color gradients, simple textures. Layer 2 combines these into mid-level features: corners, curves, small texture patterns. Deeper layers learn increasingly abstract features: object parts, shapes, eventually entire object categories. The network builds a compositional representation of visual reality.

LeCun et al. (1998) — "Gradient-based learning applied to document recognition." Introduced LeNet-5 for handwritten digit recognition — the architecture that launched modern CNNs.

Krizhevsky, Sutskever & Hinton (2012) — "ImageNet Classification with Deep Convolutional Neural Networks" (AlexNet). Won ImageNet by a huge margin, igniting the deep learning revolution. Key innovations: ReLU activations, dropout regularization, GPU training.

4.2 Recurrent Neural Networks & LSTMs

Language is sequential: the meaning of "bank" depends on whether the previous words were "river" or "investment." RNNs process sequences one element at a time, maintaining a hidden state that acts as a "memory" of what came before.

RNN Equations

Vanilla RNN $$\mathbf{h}_t = \tanh(\mathbf{W}_{hh}\mathbf{h}_{t-1} + \mathbf{W}_{xh}\mathbf{x}_t + \mathbf{b}_h)$$ $$\mathbf{y}_t = \mathbf{W}_{hy}\mathbf{h}_t + \mathbf{b}_y$$

At each time step $t$, the hidden state $\mathbf{h}_t$ is a function of the previous state $\mathbf{h}_{t-1}$ and the current input $\mathbf{x}_t$. The same weights $\mathbf{W}_{hh}, \mathbf{W}_{xh}$ are shared across all time steps.

The Problem: Vanishing/Exploding Gradients

During backpropagation through time (BPTT), the gradient at time step $t$ depends on a product of Jacobians: $\prod_{k=t}^{T} \frac{\partial \mathbf{h}_{k}}{\partial \mathbf{h}_{k-1}}$. For long sequences, this product either explodes or vanishes, making it impossible to learn long-range dependencies.

LSTM — Solving the Memory Problem

LSTM Equations $$\mathbf{f}_t = \sigma(\mathbf{W}_f[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_f) \quad \text{(forget gate)}$$ $$\mathbf{i}_t = \sigma(\mathbf{W}_i[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_i) \quad \text{(input gate)}$$ $$\tilde{\mathbf{c}}_t = \tanh(\mathbf{W}_c[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_c) \quad \text{(candidate memory)}$$ $$\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \mathbf{i}_t \odot \tilde{\mathbf{c}}_t \quad \text{(cell state update)}$$ $$\mathbf{o}_t = \sigma(\mathbf{W}_o[\mathbf{h}_{t-1}, \mathbf{x}_t] + \mathbf{b}_o) \quad \text{(output gate)}$$ $$\mathbf{h}_t = \mathbf{o}_t \odot \tanh(\mathbf{c}_t)$$

The key idea: The cell state $\mathbf{c}_t$ is an "information highway" — it flows through time with only additive updates, so gradients can flow backward without vanishing. The forget gate decides what to erase from memory; the input gate decides what new information to store.

Hochreiter & Schmidhuber (1997) — "Long Short-Term Memory." The paper that solved the vanishing gradient problem for recurrent networks with the gating mechanism. LSTMs dominated sequence modeling for nearly two decades.

(a) In an LSTM, if the forget gate outputs all 1s and the input gate outputs all 0s, what happens to the cell state? What about the reverse?
(b) Why is the cell state update additive ($\mathbf{c}_t = \mathbf{f}_t \odot \mathbf{c}_{t-1} + \ldots$) rather than multiplicative? How does this help gradient flow?

· · ·

Part V

NLP Foundations

5.1 Word Embeddings — Giving Meaning to Numbers

How do you represent the word "king" as a number a neural network can process? One-hot encoding (a vector of mostly zeros with a single 1) treats every word as equally different from every other — "king" is as different from "queen" as from "banana." Word embeddings learn dense vectors where similar words are near each other in the vector space.

The Distributional Hypothesis

"You shall know a word by the company it keeps" — J.R. Firth, 1957. Words that appear in similar contexts have similar meanings. "Dog" and "cat" appear near "pet," "fur," "vet." This insight is the foundation of all modern word embeddings.

Word2Vec (Skip-Gram) — Derivation

Mikolov et al. (2013) — "Efficient Estimation of Word Representations in Vector Space" and "Distributed Representations of Words and Phrases and their Compositionality." These papers showed that simple neural networks trained on word co-occurrence learn vectors with remarkable algebraic properties (e.g., king - man + woman ≈ queen).

Setup: Given a center word $w_c$, predict the surrounding context words $w_o$. Each word has two embedding vectors: $\mathbf{v}_w$ (when it's the center word) and $\mathbf{u}_w$ (when it's a context word).

Probability model:

Skip-Gram Objective $$P(w_o \mid w_c) = \frac{\exp(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c})}{\sum_{w \in V} \exp(\mathbf{u}_w^T \mathbf{v}_{w_c})}$$

This is a softmax over the entire vocabulary. The numerator is the dot product between the context word's embedding and the center word's embedding — a measure of compatibility. The denominator normalizes across all possible words.

Training objective: Maximize the log-probability of observed (center, context) pairs:

$$\mathcal{L} = \sum_{t=1}^{T}\sum_{-m \leq j \leq m, j \neq 0} \log P(w_{t+j} \mid w_t)$$

Negative Sampling approximation: Computing the full softmax over 100K+ words is prohibitive. Instead, for each positive pair, we sample $k$ "negative" words (random words unlikely to be in the context) and train a binary classifier:

Negative Sampling Objective $$\mathcal{L} = \log\sigma(\mathbf{u}_{w_o}^T \mathbf{v}_{w_c}) + \sum_{i=1}^{k} \mathbb{E}_{w_i \sim P_n(w)}\left[\log\sigma(-\mathbf{u}_{w_i}^T \mathbf{v}_{w_c})\right]$$

This says: maximize the dot product with the true context word (positive pair) while minimizing the dot product with random words (negative pairs). The result: words appearing in similar contexts get pulled together in vector space.

GloVe — Global Vectors

Pennington, Socher & Manning (2014) — "GloVe: Global Vectors for Word Representation." Instead of predicting from local windows, GloVe directly factorizes the global co-occurrence matrix.

Key insight: Let $X_{ij}$ be the count of how often word $j$ appears in the context of word $i$. GloVe optimizes:

GloVe Objective $$\mathcal{L} = \sum_{i,j=1}^{|V|} f(X_{ij}) \left(\mathbf{w}_i^T \tilde{\mathbf{w}}_j + b_i + \tilde{b}_j - \log X_{ij}\right)^2$$

Where $f(X_{ij})$ is a weighting function that prevents very common co-occurrences from dominating. The objective says: the dot product of two word vectors should approximate the log of how often they co-occur.

Subword Tokenization (BPE)

Modern LLMs don't embed whole words — they use subword tokens. Byte Pair Encoding (BPE) starts with individual characters and iteratively merges the most frequent pairs: "l o w e r" → "lo w e r" → "low e r" → "low er" → "lower". This handles unseen words by decomposing them into known subwords, and reduces vocabulary size while maintaining expressiveness.

(a) If king - man + woman ≈ queen works in vector space, what does this tell you about how the embedding dimensions encode meaning?
(b) Why does the skip-gram model need two different embedding matrices (center vs. context)? What would happen if we used the same matrix for both?

· · ·

Part VI

Attention & Transformers

This is the section that makes everything else click. The transformer architecture, built entirely on the attention mechanism, is the foundation of all modern language models.

6.1 The Attention Mechanism — From First Principles

Imagine translating "The cat sat on the mat" into French. When generating the French word for "cat," you need to focus on ("attend to") the English word "cat" — not "the" or "mat." Attention is a mechanism that lets the model dynamically decide which parts of the input to focus on for each output.

Attention as Weighted Retrieval

Think of attention as a soft dictionary lookup. You have a query (what you're looking for), a set of keys (labels in the dictionary), and values (the stored information). The query is compared to all keys to compute relevance scores, and the output is a weighted sum of values.

Dot-Product Attention — Derivation

Step 1: Measure similarity. Given a query $\mathbf{q}$ and a key $\mathbf{k}$, their similarity is the dot product $\mathbf{q}^T \mathbf{k}$. This measures how aligned the query is with the key in the embedding space.

Step 2: Scale. The dot product grows with dimensionality $d_k$, which pushes softmax into saturated regions with tiny gradients. We divide by $\sqrt{d_k}$:

$$\text{score}(q, k) = \frac{\mathbf{q}^T \mathbf{k}}{\sqrt{d_k}}$$

Why $\sqrt{d_k}$? If the entries of $\mathbf{q}$ and $\mathbf{k}$ are i.i.d. with zero mean and unit variance, their dot product has mean 0 and variance $d_k$. Dividing by $\sqrt{d_k}$ normalizes the variance to 1, keeping the softmax input in a well-behaved range.

Step 3: Normalize to get attention weights.

$$\alpha_i = \text{softmax}_i = \frac{\exp(\text{score}(q, k_i))}{\sum_j \exp(\text{score}(q, k_j))}$$

Step 4: Compute weighted output.

$$\text{output} = \sum_i \alpha_i \mathbf{v}_i$$

Putting it all together in matrix form:

Scaled Dot-Product Attention $$\text{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}}\right)\mathbf{V}$$

Where $\mathbf{Q} \in \mathbb{R}^{n \times d_k}$, $\mathbf{K} \in \mathbb{R}^{m \times d_k}$, $\mathbf{V} \in \mathbb{R}^{m \times d_v}$. The output is $\mathbb{R}^{n \times d_v}$.

Tiny attention example. Sequence: ["I", "love", "cats"], $d_k = 2$.

$\mathbf{Q} = \begin{bmatrix}1 & 0\\0 & 1\\1 & 1\end{bmatrix}$, $\mathbf{K} = \begin{bmatrix}1 & 0\\0 & 1\\0.5 & 0.5\end{bmatrix}$, $\mathbf{V} = \begin{bmatrix}1 & 0\\0 & 1\\0.5 & 0.5\end{bmatrix}$

Scores $= \mathbf{Q}\mathbf{K}^T / \sqrt{2}$:
$\begin{bmatrix}1 & 0 & 0.5\\0 & 1 & 0.5\\1 & 1 & 1\end{bmatrix} / 1.41 = \begin{bmatrix}0.71 & 0 & 0.35\\0 & 0.71 & 0.35\\0.71 & 0.71 & 0.71\end{bmatrix}$

Attention weights (softmax each row):
Row 1: $[0.41, 0.20, 0.29]$ — "I" mostly attends to itself.
Row 3: $[0.33, 0.33, 0.33]$ — "cats" attends equally (all scores equal).

Output = weights × V = weighted combination of value vectors for each position.

6.2 Self-Attention

In self-attention, the queries, keys, and values all come from the same sequence. Each token attends to all other tokens (including itself) in the sequence:

Self-Attention $$\mathbf{Q} = \mathbf{X}\mathbf{W}^Q, \quad \mathbf{K} = \mathbf{X}\mathbf{W}^K, \quad \mathbf{V} = \mathbf{X}\mathbf{W}^V$$

Where $\mathbf{X} \in \mathbb{R}^{n \times d}$ is the sequence of input embeddings, and $\mathbf{W}^Q, \mathbf{W}^K \in \mathbb{R}^{d \times d_k}$, $\mathbf{W}^V \in \mathbb{R}^{d \times d_v}$ are learned projection matrices.

What does self-attention compute? For each token, it asks "which other tokens in this sequence are relevant to understanding me?" and creates a new representation that blends information from those relevant tokens. In "The animal didn't cross the street because it was too wide," self-attention at the word "it" can learn to assign high weight to "street" (the thing that's wide), resolving the reference.

6.3 Multi-Head Attention

A single attention head can only capture one type of relationship at a time (e.g., syntactic dependency). Multiple heads let the model capture different types of relationships simultaneously — one head might capture subject-verb agreement, another might capture coreference, another might capture semantic similarity.

Multi-Head Attention $$\text{head}_i = \text{Attention}(\mathbf{X}\mathbf{W}_i^Q, \mathbf{X}\mathbf{W}_i^K, \mathbf{X}\mathbf{W}_i^V)$$ $$\text{MultiHead}(\mathbf{X}) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h)\mathbf{W}^O$$

If the model dimension is $d = 512$ and we use $h = 8$ heads, each head uses $d_k = d_v = d/h = 64$. The total computation is similar to single-head attention with full dimensionality, but we get $h$ different attention patterns.

6.4 The Transformer — Full Architecture

Vaswani et al. (2017) — "Attention Is All You Need." The paper that replaced RNNs with pure attention for sequence modeling. Introduced the transformer architecture that now underlies GPT, BERT, and all modern LLMs. One of the most impactful papers in ML history.

Encoder Block

Each encoder layer applies two sub-layers with residual connections and layer normalization:

Transformer Encoder Layer $$\mathbf{X}' = \text{LayerNorm}(\mathbf{X} + \text{MultiHead}(\mathbf{X}))$$ $$\mathbf{X}'' = \text{LayerNorm}(\mathbf{X}' + \text{FFN}(\mathbf{X}'))$$

Where the feed-forward network (FFN) is applied position-wise:

$$\text{FFN}(\mathbf{x}) = \mathbf{W}_2 \, \phi(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) + \mathbf{b}_2$$

Typically $\mathbf{W}_1 \in \mathbb{R}^{d \times 4d}$, $\mathbf{W}_2 \in \mathbb{R}^{4d \times d}$ — the FFN expands to 4× the dimension and then projects back. This is where "knowledge" is stored (the attention layers route information; the FFN layers process it).

Positional Encoding

Self-attention is permutation-invariant — it doesn't know word order. Positional encodings inject position information:

Sinusoidal Positional Encoding $$PE_{(pos, 2i)} = \sin\!\left(\frac{pos}{10000^{2i/d}}\right), \quad PE_{(pos, 2i+1)} = \cos\!\left(\frac{pos}{10000^{2i/d}}\right)$$

Each dimension oscillates at a different frequency — low dimensions change slowly (encoding coarse position), high dimensions change rapidly (encoding fine position). This allows the model to attend to relative positions.

Decoder Block (for autoregressive generation)

The decoder adds causal (masked) self-attention — each position can only attend to previous positions, preventing the model from "seeing the future":

\text{CausalAttention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \text{softmax}\!\left(\frac{\mathbf{Q}\mathbf{K}^T}{\sqrt{d_k}} + \mathbf{M}\right)\mathbf{V}

Where $M_{ij} = 0$ if $i \geq j$, and $M_{ij} = -\infty$ otherwise. The $-\infty$ values become 0 after softmax, effectively masking future positions.

Putting It All Together

A full transformer for language modeling: Input tokens → Embedding + Positional Encoding → N × Decoder Blocks → Linear projection → Softmax over vocabulary → Predicted next token.

(a) In a model with $d = 256$, $h = 4$ heads, what are the dimensions of $\mathbf{W}_i^Q$, $\mathbf{W}_i^K$, $\mathbf{W}_i^V$ for each head?
(b) Write out the causal mask $\mathbf{M}$ for a sequence of length 4. After softmax, what does each row of the attention matrix represent?
(c) Why is the residual connection ($\mathbf{X} + \text{sublayer}(\mathbf{X})$) critical for training deep transformers? (Hint: think about gradient flow.)

· · ·

Part VII

Modern Large Language Models

7.1 The GPT Architecture

Radford et al. (2018, 2019) — GPT and GPT-2. Showed that a decoder-only transformer pretrained on next-token prediction achieves remarkable few-shot capabilities. GPT-2 demonstrated that scaling up model size and data leads to emergent abilities.

Brown et al. (2020) — "Language Models are Few-Shot Learners" (GPT-3). Scaled to 175B parameters and showed that LLMs can perform tasks from a few examples in the prompt, without any gradient updates.

GPT is a decoder-only transformer. Its single training objective is strikingly simple:

Next-Token Prediction (Causal Language Modeling)

Training Objective $$\mathcal{L} = -\sum_{t=1}^{T} \log P(x_t \mid x_1, x_2, \ldots, x_{t-1}; \theta)$$

For each position $t$, the model predicts the probability distribution over the entire vocabulary for the next token, given all previous tokens. The loss is the cross-entropy between this prediction and the actual next token.

How this single objective produces intelligence: To predict the next word well, the model must learn grammar, facts, reasoning, common sense, style, and much more. The prediction task is a "universal" task that requires understanding language at every level.

The Softmax Output Layer

The transformer's output at position $t$ is a vector $\mathbf{h}_t \in \mathbb{R}^d$. We project it to vocabulary size and apply softmax:

Metric	Formula	Use
Euclidean	\(\\|\mathbf{a} - \mathbf{b}\\|_2\)	K-nearest neighbors, clustering
Cosine similarity	\(\frac{\mathbf{a}^T\mathbf{b}}{\\|\mathbf{a}\\|\\|\mathbf{b}\\|}\)	Word embeddings, sentence similarity, retrieval
Dot product	\(\mathbf{a}^T\mathbf{b}\)	Attention scores (unnormalized similarity)
Mahalanobis	\(\sqrt{(\mathbf{a}-\mathbf{b})^T\Sigma^{-1}(\mathbf{a}-\mathbf{b})}\)	Distance accounting for correlations

#	Paper	Year	Why It Matters
1	Rumelhart, Hinton & Williams — Backpropagation	1986	Made neural network training practical
2	LeCun et al. — LeNet / CNNs	1998	Proved deep learning works for vision
3	Hochreiter & Schmidhuber — LSTM	1997	Solved vanishing gradient for sequences
4	Mikolov et al. — Word2Vec	2013	Efficient word embeddings; word arithmetic
5	Pennington et al. — GloVe	2014	Global co-occurrence matrix factorization
6	Bahdanau et al. — Attention for NMT	2015	Introduced attention for seq2seq
7	Vaswani et al. — Transformer	2017	Replaced RNNs entirely; the foundation of LLMs
8	Devlin et al. — BERT	2019	Bidirectional pretraining; dominated NLU benchmarks
9	Radford et al. — GPT-2	2019	Showed language modeling scales to generation quality
10	Brown et al. — GPT-3	2020	In-context learning and few-shot capabilities at scale
11	Kaplan et al. — Scaling Laws	2020	Power-law relationships for compute-optimal training
12	Hoffmann et al. — Chinchilla	2022	Compute-optimal: more data, smaller model
13	Ouyang et al. — InstructGPT / RLHF	2022	Aligning models with human preferences
14	Touvron et al. — LLaMA	2023	Open-weight models competitive with proprietary
15	Rafailov et al. — DPO	2023	Simplified alignment without RL
16	DeepSeek Team — DeepSeek-V3	2024	MoE with fine-grained routing; open-weight frontier model

Operation	Notation	ML Role
Dot product	\(\mathbf{a}^T \mathbf{b}\)	Similarity, attention scores, linear predictions
Matrix multiply	\(\mathbf{W}\mathbf{x}\)	Linear transformations / neural network layers
Transpose	\(\mathbf{A}^T\)	Switching rows ↔ columns, computing \(\mathbf{X}^T\mathbf{X}\)
Norm	\(\\|\mathbf{x}\\| = \sqrt{\sum x_i^2}\)	Regularization, normalization, distance
Outer product	\(\mathbf{a}\mathbf{b}^T\)	Rank-1 updates, attention value computation

Term	Name	Meaning
\(P(H \mid D)\)	Posterior	Updated belief about hypothesis \(H\) after seeing data \(D\)
\(P(D \mid H)\)	Likelihood	How probable the data is if hypothesis \(H\) is true
\(P(H)\)	Prior	Initial belief before seeing data
\(P(D)\)	Evidence	Total probability of the data (normalizer)

Loss	Formula	When to Use	Probabilistic Origin
MSE	\(\frac{1}{N}\sum(y_i - \hat{y}_i)^2\)	Regression	Gaussian likelihood
Binary CE	\(-\frac{1}{N}\sum[y\log\hat{p} + (1-y)\log(1-\hat{p})]\)	Binary classification	Bernoulli likelihood
Categorical CE	\(-\frac{1}{N}\sum\sum_{k} y_k \log \hat{p}_k\)	Multi-class	Categorical likelihood

Name	Formula	Derivative	Notes
Sigmoid	\(\frac{1}{1+e^{-z}}\)	\(\sigma(z)(1-\sigma(z))\)	Squashes to (0,1); causes vanishing gradients
Tanh	\(\frac{e^z - e^{-z}}{e^z + e^{-z}}\)	\(1 - \tanh^2(z)\)	Squashes to (-1,1); zero-centered
ReLU	\(\max(0, z)\)	\(\begin{cases}1 & z > 0\\0 & z \leq 0\end{cases}\)	Simple, fast; default choice for hidden layers
GELU	\(z \cdot \Phi(z)\)	smooth approx.	Used in transformers; smooth version of ReLU

Machine Learning from First Principlesto Modern NLP

Contents

Mathematical Foundations

1.1 Linear Algebra Essentials

Vectors

Dot product — the most important operation in ML

Matrices and Linear Transformations

Key operations summary

1.2 Probability & Statistics

Bayes' Theorem — The Engine of Inference

Key Distributions

Maximum Likelihood Estimation (MLE)

1.3 Optimization Basics

Gradient — the Direction of Steepest Ascent

Gradient Descent Update Rule

Stochastic Gradient Descent (SGD)

The Chain Rule — Foundation of Backpropagation

Core Machine Learning

2.1 Linear Regression — Full Derivation

The Model

The Loss Function — Why Squared Error?

Closed-Form Solution (Normal Equation)

Gradient Descent Solution

2.2 Logistic Regression — Full Derivation

The Sigmoid Function

The Model

Loss Function — Deriving Cross-Entropy from MLE

Gradient Derivation

Geometric Interpretation — The Decision Boundary

2.3 Loss Functions — A Unified View

Neural Networks

3.1 From Linear Models to Neural Networks

Why linearity isn't enough

A Single Neuron

3.2 Multi-Layer Network (MLP) — Forward Pass

3.3 Backpropagation — Full Derivation

Step-by-step for a 2-layer network with sigmoid output and BCE loss

The General Pattern

Deep Learning Architectures

4.1 Convolutional Neural Networks

The Convolution Operation

CNN Architecture Pattern

How CNNs Learn Pattern Hierarchies

4.2 Recurrent Neural Networks & LSTMs

RNN Equations

The Problem: Vanishing/Exploding Gradients

LSTM — Solving the Memory Problem

NLP Foundations

5.1 Word Embeddings — Giving Meaning to Numbers

The Distributional Hypothesis

Word2Vec (Skip-Gram) — Derivation

GloVe — Global Vectors

Subword Tokenization (BPE)

Attention & Transformers

6.1 The Attention Mechanism — From First Principles

Attention as Weighted Retrieval

Dot-Product Attention — Derivation

6.2 Self-Attention

6.3 Multi-Head Attention

6.4 The Transformer — Full Architecture

Encoder Block

Positional Encoding

Decoder Block (for autoregressive generation)

Putting It All Together

Modern Large Language Models

7.1 The GPT Architecture

Next-Token Prediction (Causal Language Modeling)

The Softmax Output Layer

7.2 Scaling Laws

7.3 Modern Architectural Improvements

RMSNorm (replacing LayerNorm)

Rotary Position Embedding (RoPE)

SwiGLU Activation (in FFN)

Grouped-Query Attention (GQA)

7.4 Mixture of Experts (MoE) — The DeepSeek Approach

MoE FFN Layer

DeepSeek innovations

7.5 RLHF — Aligning LLMs with Human Preferences

DPO — Direct Preference Optimization

Pattern Detection — A Unified View

Machine Learning from First Principles
to Modern NLP