A complete mathematical journey from linear algebra to modern LLMs — every equation derived, every concept connected.
Linear algebra, probability, Bayes theorem, MLE, key distributions, gradients, SGD, and the chain rule — the mathematical toolkit for everything ahead.
Full derivations of linear regression (Gaussian MLE → MSE → normal equation) and logistic regression (Bernoulli MLE → cross-entropy → sigmoid gradient).
Neural networks from a single neuron through MLPs, with complete step-by-step backpropagation derivation and worked numerical examples.
CNNs as pattern matching, hierarchical feature learning, RNNs/LSTMs with gating mechanisms, and the vanishing gradient solution.
Word2Vec skip-gram with negative sampling, GloVe's co-occurrence objective, BPE tokenization, and information-theoretic foundations.
Dot-product attention from first principles, self-attention, multi-head attention, the full transformer architecture, positional encoding, and causal masking.
GPT's next-token prediction, scaling laws, RMSNorm, RoPE, SwiGLU, GQA, Mixture of Experts, and RLHF/DPO alignment.
A unified view: geometric (space-warping), probabilistic (posterior estimation), and similarity-based (attention as learned similarity) interpretations.