Machine Learning from First Principles

A complete mathematical journey from linear algebra to modern LLMs — every equation derived, every concept connected.

Foundations

Linear algebra, probability, Bayes theorem, MLE, key distributions, gradients, SGD, and the chain rule — the mathematical toolkit for everything ahead.

Deep Dives & Implementation

Full derivations of linear regression (Gaussian MLE → MSE → normal equation) and logistic regression (Bernoulli MLE → cross-entropy → sigmoid gradient).

III

Advanced Topics & Practice

Neural networks from a single neuron through MLPs, with complete step-by-step backpropagation derivation and worked numerical examples.

PyTorch Implementation Cookbook

CNNs as pattern matching, hierarchical feature learning, RNNs/LSTMs with gating mechanisms, and the vanishing gradient solution.

Information Theory & Advanced Math

Word2Vec skip-gram with negative sampling, GloVe's co-occurrence objective, BPE tokenization, and information-theoretic foundations.

Production & Deployment

Dot-product attention from first principles, self-attention, multi-head attention, the full transformer architecture, positional encoding, and causal masking.

VII

End-to-End Projects

GPT's next-token prediction, scaling laws, RMSNorm, RoPE, SwiGLU, GQA, Mixture of Experts, and RLHF/DPO alignment.

VIII

The Frontier

A unified view: geometric (space-warping), probabilistic (posterior estimation), and similarity-based (attention as learned similarity) interpretations.