Part V โ€” The Theoretical Backbone

Information Theory, Generative Models
& Advanced Mathematics

The mathematical structures that underpin why models work: entropy, KL divergence, VAEs,
diffusion models, the information bottleneck, and the geometry of neural networks.

Contents

  1. Information Theory for ML
    • Entropy ยท Cross-entropy ยท KL Divergence ยท Mutual Information
  2. Variational Autoencoders (VAEs)
    • Latent variables ยท ELBO derivation ยท Reparameterization trick
  3. Generative Adversarial Networks (GANs)
    • Minimax game ยท Nash equilibrium ยท Training dynamics
  4. Diffusion Models
    • Forward process ยท Reverse process ยท Score matching ยท DDPM
  5. The Geometry of Neural Networks
    • Loss landscapes ยท Saddle points ยท Lottery ticket hypothesis ยท Double descent
  6. Information Bottleneck & Generalization Theory
    • Bias-variance ยท PAC learning ยท Compression ยท Why deep learning generalizes
  7. Advanced Probability for ML
    • Exponential families ยท Conjugate priors ยท Variational inference ยท MCMC
Chapter 1

Information Theory for ML

Information theory answers one question: how much surprise does an event contain? A certain event (sun rises) carries zero information. A rare event (meteor strike) carries lots. Every loss function in ML can be understood through this lens.

Entropy โ€” Measuring Uncertainty

Shannon Entropy $$H(X) = -\sum_{x} p(x) \log_2 p(x) = \mathbb{E}[-\log_2 p(X)]$$

Entropy is the average number of bits needed to encode a draw from distribution \(p\). Maximum entropy = maximum uncertainty (uniform distribution). Zero entropy = complete certainty (degenerate distribution).

Fair coin: \(H = -(0.5 \log_2 0.5 + 0.5 \log_2 0.5) = -(-0.5 - 0.5) = 1\) bit.
Biased coin (p=0.99): \(H = -(0.99 \log_2 0.99 + 0.01 \log_2 0.01) \approx 0.081\) bits.
Certain outcome (p=1): \(H = -(1 \cdot \log_2 1) = 0\) bits.

The fair coin needs 1 bit to encode (just say H or T). The biased coin almost never flips tails, so on average you need almost no information to describe the outcome.

Cross-Entropy โ€” The Loss Function

Cross-Entropy $$H(p, q) = -\sum_x p(x) \log q(x) = H(p) + D_{\text{KL}}(p \| q)$$

What it measures: The average number of bits needed to encode data from distribution \(p\) using a code optimized for distribution \(q\). If \(q = p\) (perfect model), cross-entropy equals entropy โ€” the theoretical minimum. Any gap is wasted bits.

This is why cross-entropy is our loss function. When we train a model \(q_\theta\) to match the true data distribution \(p_{\text{data}}\), minimizing cross-entropy \(H(p_{\text{data}}, q_\theta)\) is equivalent to minimizing KL divergence \(D_{\text{KL}}(p \| q_\theta)\) since \(H(p)\) is a constant. And minimizing KL divergence = maximizing likelihood (MLE). It all connects.

KL Divergence โ€” Distance Between Distributions

Kullback-Leibler Divergence $$D_{\text{KL}}(p \| q) = \sum_x p(x) \log \frac{p(x)}{q(x)} = \mathbb{E}_p\left[\log\frac{p(X)}{q(X)}\right]$$

Properties:

1. \(D_{\text{KL}} \geq 0\) always (Gibbs' inequality). Zero iff \(p = q\).

2. Not symmetric: \(D_{\text{KL}}(p \| q) \neq D_{\text{KL}}(q \| p)\) in general.

3. \(D_{\text{KL}}(p \| q) = H(p, q) - H(p)\): the "extra bits" above entropy.

Forward vs Reverse KL

\(D_{\text{KL}}(p \| q)\) (forward KL) โ€” Penalizes \(q\) for assigning low probability where \(p\) is high. Makes \(q\) "cover" all modes of \(p\) โ†’ mean-seeking. Used in training (MLE).

\(D_{\text{KL}}(q \| p)\) (reverse KL) โ€” Penalizes \(q\) for assigning high probability where \(p\) is low. Makes \(q\) concentrate on the highest mode โ†’ mode-seeking. Used in variational inference.

Mutual Information

Mutual Information $$I(X; Y) = H(X) - H(X \mid Y) = D_{\text{KL}}(p(x,y) \| p(x)p(y))$$

How much knowing \(Y\) reduces uncertainty about \(X\). Zero if independent, maximum if one completely determines the other. Plays a key role in feature selection, the information bottleneck, and understanding what neural networks learn.

(a) A language model assigns probabilities [0.7, 0.2, 0.05, 0.05] to next tokens ["the", "a", "an", "one"]. The true distribution is [0.6, 0.3, 0.08, 0.02]. Compute the cross-entropy and KL divergence.
(b) Why does minimizing cross-entropy \(H(p, q_\theta)\) give the same optimal \(\theta\) as minimizing \(D_{\text{KL}}(p \| q_\theta)\)? Show this algebraically.
(c) A perfectly uniform language model over a 50K vocabulary has what perplexity? What does this tell you about the baseline difficulty of language modeling?
ยท ยท ยท
Chapter 2

Variational Autoencoders

Kingma & Welling (2014) โ€” "Auto-Encoding Variational Bayes." Introduced a practical framework for training latent variable models using a combination of neural networks and variational inference.
You want to learn a model that can generate new data. The idea: data is generated from low-dimensional latent variables \(\mathbf{z}\) (e.g., "style," "content," "pose" for face images). But you can't observe \(\mathbf{z}\) directly โ€” you need to infer it. VAEs solve this with a clever combination of an encoder (infers \(\mathbf{z}\) from data) and a decoder (generates data from \(\mathbf{z}\)).

The Generative Model

$$p(\mathbf{x}) = \int p(\mathbf{x} \mid \mathbf{z}) p(\mathbf{z}) d\mathbf{z}$$

Where \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, \mathbf{I})\) (prior) and \(p(\mathbf{x} \mid \mathbf{z}) = \text{Decoder}_\theta(\mathbf{z})\). The integral is intractable (we'd need to integrate over all possible \(\mathbf{z}\)), so we can't compute \(p(\mathbf{x})\) or maximize it directly.

The ELBO โ€” Evidence Lower Bound

Introduce an approximate posterior \(q_\phi(\mathbf{z} \mid \mathbf{x})\) (the encoder). We can show:

ELBO Derivation $$\log p(\mathbf{x}) = \underbrace{\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[\log p_\theta(\mathbf{x} \mid \mathbf{z})]}_{\text{reconstruction}} - \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z}))}_{\text{regularization}} + \underbrace{D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z} \mid \mathbf{x}))}_{\geq 0 \text{ (gap)}}$$

Since the last term \(\geq 0\), the first two terms form a lower bound on \(\log p(\mathbf{x})\):

ELBO $$\text{ELBO}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x} \mid \mathbf{z})] - D_{\text{KL}}(q_\phi(\mathbf{z} \mid \mathbf{x}) \| p(\mathbf{z})) \leq \log p(\mathbf{x})$$

Term 1 (Reconstruction): How well can the decoder reconstruct \(\mathbf{x}\) from \(\mathbf{z}\)? Encourages accurate encoding.
Term 2 (KL Regularizer): How close is the encoder's posterior to the prior? Prevents the encoder from "cheating" by making each \(\mathbf{z}\) unique to each \(\mathbf{x}\).

The Reparameterization Trick

Problem: we need to backpropagate through the sampling operation \(\mathbf{z} \sim q_\phi(\mathbf{z} \mid \mathbf{x})\). Sampling is not differentiable. Solution: reparameterize.

Reparameterization $$\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}, \quad \boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$

Now the randomness is in \(\boldsymbol{\epsilon}\), which doesn't depend on \(\phi\). Gradients flow through \(\boldsymbol{\mu}\) and \(\boldsymbol{\sigma}\) normally.

ยท ยท ยท
Chapter 3

Generative Adversarial Networks

Goodfellow et al. (2014) โ€” "Generative Adversarial Nets." Introduced the adversarial training framework: a generator learns to produce realistic data while a discriminator learns to distinguish real from fake. One of the most creative ideas in ML.

The Minimax Game

GAN Objective $$\min_G \max_D \; \mathbb{E}_{\mathbf{x} \sim p_{\text{data}}}[\log D(\mathbf{x})] + \mathbb{E}_{\mathbf{z} \sim p(\mathbf{z})}[\log(1 - D(G(\mathbf{z})))]$$

\(D(\mathbf{x})\) is the discriminator's estimate that \(\mathbf{x}\) is real. \(G(\mathbf{z})\) is the generator's output from noise \(\mathbf{z}\). The discriminator maximizes (get better at distinguishing). The generator minimizes (fool the discriminator).

At Nash Equilibrium

Goodfellow proved that the optimal discriminator is \(D^*(\mathbf{x}) = \frac{p_{\text{data}}(\mathbf{x})}{p_{\text{data}}(\mathbf{x}) + p_g(\mathbf{x})}\). Substituting this back, the generator's loss becomes \(2 \cdot D_{\text{JS}}(p_{\text{data}} \| p_g) - \log 4\), where \(D_{\text{JS}}\) is the Jensen-Shannon divergence. At equilibrium: \(p_g = p_{\text{data}}\) and \(D(\mathbf{x}) = 0.5\) everywhere.

GANs convert the problem of density estimation (hard) into a classification problem (easier). Instead of asking "what's the probability of this image?", they ask "can you tell if this image is real or fake?" โ€” which is a much more tractable question.
ยท ยท ยท
Chapter 4

Diffusion Models

Ho, Jain & Abbeel (2020) โ€” "Denoising Diffusion Probabilistic Models" (DDPM). Showed that iteratively denoising Gaussian noise can generate high-quality images, establishing diffusion models as the new state-of-the-art for image generation. The foundation of Stable Diffusion, DALL-E 2, and Midjourney.
Imagine you have a beautiful photograph. You gradually add noise to it, step by step, until it becomes pure static. Now train a neural network to reverse each step โ€” to slightly denoise the image. If you can do this perfectly, you can start from pure noise and iteratively refine it into a photograph.

Forward Process (Adding Noise)

Forward Diffusion $$q(\mathbf{x}_t \mid \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1-\beta_t}\,\mathbf{x}_{t-1}, \beta_t \mathbf{I})$$

At each step \(t\), add a small amount of Gaussian noise controlled by \(\beta_t\). After \(T\) steps (typically 1000), \(\mathbf{x}_T \approx \mathcal{N}(\mathbf{0}, \mathbf{I})\). A key result lets us jump directly to any step:

$$q(\mathbf{x}_t \mid \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t}\,\mathbf{x}_0, (1-\bar{\alpha}_t)\mathbf{I})$$ $$\text{where } \alpha_t = 1 - \beta_t, \quad \bar{\alpha}_t = \prod_{s=1}^{t}\alpha_s$$

Reverse Process (Denoising)

The reverse process is also Gaussian for small \(\beta_t\):

Learned Reverse Process $$p_\theta(\mathbf{x}_{t-1} \mid \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \sigma_t^2 \mathbf{I})$$

Training Objective โ€” Predict the Noise

Instead of predicting \(\boldsymbol{\mu}\) directly, DDPM reparameterizes to predict the noise \(\boldsymbol{\epsilon}\) that was added:

DDPM Training Loss (simplified) $$\mathcal{L} = \mathbb{E}_{t, \mathbf{x}_0, \boldsymbol{\epsilon}}\left[\|\boldsymbol{\epsilon} - \boldsymbol{\epsilon}_\theta(\sqrt{\bar{\alpha}_t}\,\mathbf{x}_0 + \sqrt{1-\bar{\alpha}_t}\,\boldsymbol{\epsilon}, \; t)\|^2\right]$$

In plain English: Sample a training image \(\mathbf{x}_0\). Pick a random noise level \(t\). Add noise to get \(\mathbf{x}_t\). Train the network \(\boldsymbol{\epsilon}_\theta\) to predict what noise was added. That's it โ€” an MSE loss on noise prediction.

Sampling (Generation)

Start from \(\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})\). For \(t = T, T-1, \ldots, 1\):

$$\mathbf{x}_{t-1} = \frac{1}{\sqrt{\alpha_t}}\left(\mathbf{x}_t - \frac{\beta_t}{\sqrt{1-\bar{\alpha}_t}} \boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t)\right) + \sigma_t \mathbf{z}$$

Each step removes a bit of predicted noise, gradually revealing a clean image.

Connection to score matching: The noise predictor \(\boldsymbol{\epsilon}_\theta\) is directly related to the score function (gradient of log probability): \(\nabla_{\mathbf{x}} \log p(\mathbf{x}_t) \propto -\boldsymbol{\epsilon}_\theta(\mathbf{x}_t, t) / \sqrt{1-\bar{\alpha}_t}\). Diffusion models are essentially learning the score at every noise level, then using it to denoise via Langevin dynamics.
ยท ยท ยท
Chapter 5

The Geometry of Neural Networks

Loss Landscapes

The loss function \(\mathcal{L}(\theta)\) defines a surface over the parameter space \(\theta \in \mathbb{R}^d\). For a 7B parameter model, this is a surface in 7-billion-dimensional space. Despite the extreme dimensionality, some universal properties emerge:

Saddle points dominate over local minima. In high dimensions, a critical point (where \(\nabla\mathcal{L} = 0\)) is a local minimum only if all eigenvalues of the Hessian are positive. The probability of this decreases exponentially with dimension. Most critical points are saddle points โ€” local minima along some directions, local maxima along others. SGD's noise helps it escape these.

Connected low-loss regions. Different solutions (minima) in parameter space tend to be connected by paths of low loss โ€” the "mode connectivity" phenomenon. This suggests the loss landscape is more like a valley system than isolated basins.

The Lottery Ticket Hypothesis

Frankle & Carlin (2019) โ€” "The Lottery Ticket Hypothesis." Dense, randomly-initialized networks contain sparse subnetworks ("winning tickets") that, when trained in isolation, reach comparable accuracy. Implication: most parameters in a trained network are redundant โ€” the network is dramatically overparameterized.

Double Descent

Classical ML says: more parameters โ†’ overfitting โ†’ worse test error (bias-variance tradeoff). But modern deep learning contradicts this โ€” test error first decreases, then increases (interpolation threshold), then decreases again as you add more parameters.

Double Descent Phenomenon $$\text{Test error} \sim \begin{cases}\text{decreasing} & \text{(underfitting regime)} \\ \text{spike} & \text{(interpolation threshold: params โ‰ˆ data)} \\ \text{decreasing again} & \text{(overparameterized regime)}\end{cases}$$
In the overparameterized regime, there are many solutions that fit the training data perfectly. SGD with implicit regularization (from the initialization, learning rate, and noise) selects the "simplest" interpolating solution โ€” which generalizes well. More parameters = more interpolating solutions to choose from = better chance of finding a simple one.
ยท ยท ยท
Chapter 6

Information Bottleneck & Generalization

The Bias-Variance Decomposition

Bias-Variance Decomposition $$\mathbb{E}[(y - \hat{f}(\mathbf{x}))^2] = \underbrace{\text{Bias}^2[\hat{f}]}_{\text{model too simple}} + \underbrace{\text{Var}[\hat{f}]}_{\text{model too complex}} + \underbrace{\sigma^2_\epsilon}_{\text{irreducible noise}}$$

Bias: error from approximation assumptions (linear model for nonlinear data). Variance: error from sensitivity to training data fluctuations (overfitting). The classic tradeoff: increasing model complexity reduces bias but increases variance. But deep networks break this tradeoff in the overparameterized regime.

The Information Bottleneck

Tishby & Zaslavsky (2015) โ€” "Deep Learning and the Information Bottleneck Principle." Proposed that deep networks learn by first fitting the data (memorization), then compressing the representation (generalization). Each layer forms an information bottleneck that keeps the information relevant for the task while discarding noise.
Information Bottleneck Objective $$\min_{p(\hat{x}|x)} I(X; \hat{X}) - \beta \cdot I(\hat{X}; Y)$$

Minimize information about the input \(X\) kept in the representation \(\hat{X}\), while maximizing information about the label \(Y\). The tradeoff is controlled by \(\beta\). In deep networks, each hidden layer can be seen as a point on this information curve.

PAC Learning โ€” Theoretical Bounds

Probably Approximately Correct (PAC) learning provides guarantees: with probability at least \(1-\delta\), the model's true error is within \(\epsilon\) of the training error, provided the training set is large enough:

$$N \geq \frac{1}{\epsilon}\left(\ln|\mathcal{H}| + \ln\frac{1}{\delta}\right)$$

Where \(|\mathcal{H}|\) is the hypothesis class size. For neural networks, classical bounds based on parameter count are extremely loose โ€” a 7B parameter model would theoretically need astronomically more data than it actually uses. This gap between theory and practice is one of the great open problems in ML theory.

ยท ยท ยท
Chapter 7

Advanced Probability for ML

Exponential Family Distributions

Most distributions used in ML belong to the exponential family:

Exponential Family $$p(\mathbf{x} \mid \boldsymbol{\eta}) = h(\mathbf{x}) \exp(\boldsymbol{\eta}^T T(\mathbf{x}) - A(\boldsymbol{\eta}))$$

Where \(\boldsymbol{\eta}\) are natural parameters, \(T(\mathbf{x})\) are sufficient statistics, and \(A(\boldsymbol{\eta})\) is the log-partition function. Gaussian, Bernoulli, Poisson, Gamma, Categorical โ€” all are exponential family. This unifies GLMs (generalized linear models): logistic regression is an exponential family model with Bernoulli likelihood and logit link.

Why this matters for deep learning: The softmax output layer is a natural parameterization for the Categorical exponential family. The cross-entropy loss equals the negative log-likelihood of this family. The "elegant gradient" \(\delta = \hat{y} - y\) that we proved in Part II holds for all exponential family GLMs โ€” it's not specific to sigmoid or softmax. This is the deep mathematical reason why our gradient formulas are so clean.

Variational Inference

When the true posterior \(p(\mathbf{z} \mid \mathbf{x})\) is intractable, approximate it with a simpler distribution \(q_\phi(\mathbf{z})\) by minimizing:

$$q^* = \arg\min_{q \in \mathcal{Q}} D_{\text{KL}}(q(\mathbf{z}) \| p(\mathbf{z} \mid \mathbf{x}))$$

This is the principle behind VAEs, Bayesian neural networks, and many probabilistic models. The choice of family \(\mathcal{Q}\) trades off expressiveness vs. tractability. Common choice: mean-field (factorized Gaussian) or normalizing flows (flexible, invertible transforms).

Monte Carlo Methods (MCMC)

When variational approximation is insufficient, we can draw samples from the posterior using Markov Chain Monte Carlo:

Metropolis-Hastings $$\alpha = \min\!\left(1, \frac{p(\mathbf{z}')p(\mathbf{z} \mid \mathbf{z}')}{p(\mathbf{z})p(\mathbf{z}' \mid \mathbf{z})}\right) \quad \text{(acceptance probability)}$$

Propose a new state \(\mathbf{z}'\), accept with probability \(\alpha\). The resulting chain of samples converges to the target distribution. Modern variants (Hamiltonian Monte Carlo, NUTS) use gradient information for efficient exploration, but MCMC remains too slow for large neural networks โ€” which is why variational methods dominate in deep learning.

(a) Derive the KL divergence between two Gaussians: \(D_{\text{KL}}(\mathcal{N}(\mu_1, \sigma_1^2) \| \mathcal{N}(\mu_2, \sigma_2^2))\). This appears in the VAE loss.
(b) In a diffusion model with \(T = 1000\) steps and linear schedule \(\beta_t\) from 0.0001 to 0.02, compute \(\bar{\alpha}_{1000}\). How close is \(\mathbf{x}_{1000}\) to pure noise?
(c) The information bottleneck says early training = memorization, late training = compression. How would you test this experimentally using mutual information estimates?
The theoretical structures in this volume aren't just abstract โ€” they explain why every practical technique
in Parts Iโ€“IV works. Information theory tells us why cross-entropy is optimal. The ELBO tells us how to
train generative models. Double descent tells us why overparameterized models don't overfit.
Theory and practice illuminate each other.