Serving at scale, the open-source ecosystem, safety & alignment,
how to read papers, and a complete glossary of 150+ terms.
Training happens once; inference happens millions of times. Optimizing inference cost-per-token is often more impactful than training improvements.
Prefill is compute-bound (matrix multiplications dominate). Decode is memory-bandwidth-bound (loading weights from HBM for each token dominates). Different optimizations target each phase.
Naive batching waits until all sequences in a batch finish before starting new ones. Continuous batching (also called "in-flight batching") immediately fills empty slots as sequences complete, keeping GPU utilization high.
| Framework | Key Feature | Best For |
|---|---|---|
| vLLM | PagedAttention, continuous batching | High-throughput serving |
| TensorRT-LLM | NVIDIA kernel optimization | Maximum NVIDIA GPU perf |
| llama.cpp | CPU/Apple Silicon inference, GGUF | Local/edge deployment |
| Ollama | User-friendly local LLM runner | Development & testing |
| SGLang | Structured generation, RadixAttention | Complex prompting pipelines |
| TGI (HuggingFace) | Easy deployment, quantization | Production HF models |
A 70B model on 1Γ A100 generates ~30 tokens/sec. At $2/GPU-hour: ~$18.5 per million tokens. With INT4 quantization + continuous batching + speculative decoding: 3β5Γ improvement β $4β6 per million tokens.
| Family | Organization | Sizes | Highlights |
|---|---|---|---|
| LLaMA 3 | Meta | 8B, 70B, 405B | Strong general-purpose; permissive license |
| Qwen 2.5/3 | Alibaba | 0.5Bβ72B | Strong multilingual; excellent coding |
| Mistral / Mixtral | Mistral AI | 7B, 8Γ7B, 8Γ22B | Sliding window attention; MoE pioneers |
| DeepSeek V3/R1 | DeepSeek | 67B, 236B (MoE) | Fine-grained MoE; strong reasoning |
| Gemma 2 | 2B, 9B, 27B | Knowledge distillation from Gemini | |
| Phi-3/4 | Microsoft | 3.8B, 14B | Small models trained on synthetic data |
| Command R+ | Cohere | 104B | Strong RAG capabilities |
Transformers: Unified API for loading and using any model. Datasets: Streaming access to thousands of datasets. PEFT: LoRA/QLoRA/Prefix tuning wrappers. TRL: Training with reinforcement learning (RLHF, DPO). Accelerate: Multi-GPU/multi-node training without code changes. Spaces: Deploy demos instantly.
| Framework | Scale | Key Feature |
|---|---|---|
| PyTorch (native) | 1β8 GPUs | Flexibility; research standard |
| DeepSpeed | 8β1000s GPUs | ZeRO optimizer; massive model training |
| FSDP (PyTorch) | 8β100s GPUs | Native PyTorch distributed; simpler than DeepSpeed |
| Megatron-LM | 100sβ1000s GPUs | Tensor + pipeline parallelism; frontier model training |
| Axolotl | 1β8 GPUs | Easy fine-tuning config; great for LoRA |
| LitGPT | 1βmany GPUs | Clear, hackable GPT implementations |
| Technique | Mechanism | Limitation |
|---|---|---|
| RLHF | Train reward model from preferences, optimize with RL | Reward hacking; human label noise |
| DPO | Directly optimize preference pairs without reward model | Limited to pairwise comparisons |
| Constitutional AI | Self-critique against principles + RLAIF | Quality of constitution matters |
| Red teaming | Adversarial testing for failures | Can't cover all failure modes |
| Input/output filters | Classifier-based content moderation | Cat-and-mouse with adversaries |
Scalable oversight: How do humans supervise AI systems on tasks that exceed human capability? Current approach: use AI to help humans evaluate AI outputs (recursive reward modeling, debate).
Deceptive alignment: Could a model learn to appear aligned during training/evaluation but pursue different goals in deployment? This is a theoretical concern for future, more capable systems.
Dual use: The same capabilities that make LLMs helpful (biological knowledge, coding ability, persuasion) also enable misuse. How do you distribute capabilities widely while limiting harm?
An agent is an LLM that can take actions in the world β calling APIs, searching the web, writing and running code, managing files β in a loop until a task is complete.
Modern LLMs are trained to output structured tool calls:
{
"tool": "search",
"arguments": {"query": "current weather Paris France"}
}
The framework executes the tool, feeds the result back, and the LLM continues. This is the foundation of AI assistants, coding agents, and autonomous workflows.
Complex tasks require decomposition. Research approaches include hierarchical planning (high-level plan β subtasks β tool calls), inner monologue (the model generates reasoning traces that guide its actions), and reflection (the model evaluates its own outputs and retries on failure).
| Section | What to Look For | Time Allocation |
|---|---|---|
| Abstract | Main claim and result. Does this paper matter to you? | 2 min |
| Introduction | Problem motivation. What gap does this fill? | 5 min |
| Figures/Tables | Architecture diagrams, result tables. The visual story. | 5 min (first pass) |
| Method | The new equations and algorithms. This is the core. | 30+ min |
| Experiments | Baselines, ablations. What actually matters? | 15 min |
| Related Work | How this fits in the field. New papers to read. | 5 min |
| Conclusion | Authors' honest assessment of limitations. | 3 min |
Pass 1 (10 min): Abstract, intro, section headings, figures, conclusion. Decide if you need to read deeper.
Pass 2 (1 hour): Read everything except proofs. Understand the main ideas, mark equations you don't follow. Write a one-paragraph summary.
Pass 3 (2β4 hours): Work through the math. Re-derive key equations. Identify assumptions and limitations. Think about what you'd do differently.
No ablation study: You can't tell which components matter. Cherry-picked metrics: Reporting only metrics where the method wins. Missing baselines: Comparing against outdated or weak baselines. No error bars or confidence intervals: Results might not be significant. Excessive reliance on a single benchmark: May not generalize. Unreproducible: No code, no hyperparameter details, vague dataset description.
| Year | Milestone | Why It Matters |
|---|---|---|
| 2012 | AlexNet wins ImageNet | Deep learning goes mainstream; GPU training era begins |
| 2013 | Word2Vec | Efficient word embeddings; word arithmetic |
| 2014 | GloVe, GANs, Seq2Seq + Attention | Three foundational ideas in one year |
| 2015 | ResNet, Batch Norm, Bahdanau Attention | Training very deep nets; attention for translation |
| 2017 | "Attention Is All You Need" | Transformer replaces RNNs; the modern era begins |
| 2018 | GPT-1, BERT, ELMo | Pretraining + fine-tuning paradigm established |
| 2019 | GPT-2, T5, DistilBERT | Scaling up; text-to-text unification; distillation |
| 2020 | GPT-3, Scaling Laws, ViT, DDPM | In-context learning; scaling theory; vision transformers |
| 2021 | CLIP, Codex, LoRA | Multimodal; code generation; efficient fine-tuning |
| 2022 | ChatGPT, Chinchilla, Stable Diffusion, InstructGPT | LLMs go mainstream; compute-optimal training; text-to-image |
| 2023 | GPT-4, LLaMA, QLoRA, DPO, Mistral, Mamba | Multimodal GPT; open models; efficient alignment; SSMs |
| 2024 | Claude 3, Gemini 1.5, LLaMA 3, o1, DeepSeek-V3 | Long context; reasoning models; open MoE at frontier |
| 2025 | DeepSeek-R1, Claude 3.5/4, o3, Gemini 2 | Open reasoning models; test-time compute scaling |
Every important term from all six volumes, with precise definitions. Terms in bold have their own entries.
Activation function
Nonlinear function applied element-wise after a linear transformation. Common: ReLU, GELU, sigmoid, tanh. Enables networks to learn nonlinear relationships. [Part I Β§3.1]
Adam / AdamW
Adaptive optimizer combining momentum (first moment) and RMSProp (second moment) with bias correction. AdamW decouples weight decay from the gradient update. Standard optimizer for LLM training. [Part II Β§VIII]
Attention
Mechanism that computes weighted combinations of values based on query-key similarity: \(\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\mathbf{V}\). Foundation of the transformer. [Part I Β§6.1]
Autoregressive
Generating outputs one token at a time, where each token conditions on all previous tokens. \(P(x_t \mid x_{
Backpropagation
Algorithm for computing gradients in neural networks by recursive application of the chain rule from output to input. [Part I Β§3.3, Part II Β§I]
BERT
Bidirectional Encoder Representations from Transformers. Encoder-only model trained with masked language modeling. Excels at understanding tasks. [Part II Β§VI]
BPE (Byte Pair Encoding)
Subword tokenization algorithm that iteratively merges the most frequent character pairs. Used by GPT, LLaMA, and most modern LLMs. [Part I Β§5.1, Part II Β§IV]
Causal mask
Lower-triangular mask that prevents attention from "seeing the future." Applied in decoder-only transformers. Sets future positions to \(-\infty\) before softmax. [Part I Β§6.4]
Chain-of-thought (CoT)
Prompting technique that elicits step-by-step reasoning. Converts fixed-depth computation into O(T) serial computation. [Part III Β§IV]
Cross-entropy
\(H(p, q) = -\sum p(x) \log q(x)\). The standard loss function for classification. Equals negative log-likelihood under categorical distribution. Equivalent to KL divergence plus a constant. [Part I Β§2.2, Part V Β§1]
Diffusion model
Generative model that learns to reverse a gradual noising process. Trained to predict noise added to data. State-of-the-art for image generation (Stable Diffusion, DALL-E 3). [Part V Β§4]
DPO (Direct Preference Optimization)
Alignment method that skips reward model training, directly optimizing the policy from preference pairs. Simpler alternative to RLHF. [Part I Β§7.5]
Dropout
Regularization: randomly zero out neurons during training with probability \(p\). Trains an implicit ensemble; prevents co-adaptation. [Part II Β§VII]
ELBO (Evidence Lower Bound)
Lower bound on the log-likelihood used to train VAEs: reconstruction + KL regularization. [Part V Β§2]
Embedding
Dense vector representation of a discrete object (word, token, image patch). Learned during training. Similar items get nearby vectors. [Part I Β§5.1]
Entropy
\(H(X) = -\sum p(x) \log p(x)\). Average surprise/information in a distribution. Lower = more certain. [Part V Β§1]
Flash Attention
IO-aware attention algorithm that reduces memory from \(O(n^2)\) to \(O(n)\) via tiling and online softmax, with identical results to standard attention. [Part III Β§I]
GAN
Generative Adversarial Network. Generator vs. discriminator in a minimax game. Generator learns to produce realistic data. [Part V Β§3]
Gradient descent
\(\theta \leftarrow \theta - \eta \nabla\mathcal{L}\). Iterative optimization that moves parameters in the direction of steepest loss decrease. [Part I Β§1.3]
GQA (Grouped-Query Attention)
Multiple query heads share K/V projections, reducing KV cache memory during inference. Used in LLaMA 3, Gemma 2. [Part I Β§7.3]
In-context learning
LLM performs a task by conditioning on examples in the prompt, without gradient updates. Emergent ability at scale. [GPT-3, Part I Β§7.1]
KL divergence
\(D_{\text{KL}}(p \| q) = \sum p(x) \log(p(x)/q(x))\). Non-symmetric measure of difference between distributions. Minimized during training (= MLE). [Part V Β§1]
KV cache
Cache storing key and value vectors from previous tokens during autoregressive generation. Avoids recomputation. Major memory consumer during inference. [Part II Β§V]
LayerNorm
Normalization across the feature dimension for each individual example. Standard in transformers. Pre-norm variant applied before sublayers. [Part II Β§VII]
LoRA
Low-Rank Adaptation. Adds small trainable rank-\(r\) matrices \(\mathbf{BA}\) to frozen pretrained weights. Enables fine-tuning with 0.1β1% of parameters. [Part III Β§II, Part IV Β§10]
MLE (Maximum Likelihood Estimation)
Find parameters that maximize probability of observed data. Unifying principle: MSE (Gaussian), cross-entropy (Bernoulli/categorical) are all MLE. [Part I Β§1.2]
MoE (Mixture of Experts)
Architecture where a router selects top-K expert networks per token. Increases capacity without proportional compute increase. Used in Mixtral, DeepSeek-V3. [Part I Β§7.4]
Perplexity
\(\text{PPL} = \exp(\mathcal{L}_{\text{CE}})\). Fundamental LM metric: how "surprised" the model is. PPL of 10 means effective vocabulary of 10 per position. Lower is better. [Part III Β§VII]
Positional encoding
Mechanism to inject sequence position information into transformers. Sinusoidal (original), learned, or RoPE (rotary). [Part I Β§6.4]
Quantization
Reducing numerical precision (FP16 β INT8 β INT4) to decrease memory and increase inference speed. Methods: GPTQ, AWQ, GGUF. [Part III Β§III]
RAG (Retrieval-Augmented Generation)
Architecture combining a retriever (finds relevant documents) with an LLM generator. Reduces hallucinations, enables current knowledge. [Part III Β§V]
ReLU
\(\max(0, z)\). Simplest activation function. Derivative is 0 or 1, enabling good gradient flow. Default for hidden layers. [Part I Β§3.1]
Residual connection
\(\mathbf{x} + f(\mathbf{x})\). Skip connection that adds the input to the output of a sublayer. Critical for training deep networks β ensures gradient \(\geq 1\). [Part I Β§6.4]
RLHF
Reinforcement Learning from Human Feedback. Train reward model from preferences, then optimize policy with RL (PPO). How ChatGPT was aligned. [Part I Β§7.5]
RoPE (Rotary Position Embedding)
Encodes relative position by rotating query/key vectors. Attention score depends on \(m - n\), not absolute positions. Used in LLaMA, Qwen, most modern LLMs. [Part I Β§7.3]
Scaling laws
Power-law relationships between model performance and compute/data/parameters. Guide compute-optimal training. Chinchilla: ~20 tokens per parameter. [Part I Β§7.2]
Self-attention
Attention where Q, K, V all come from the same sequence: \(\mathbf{Q} = \mathbf{XW}^Q\), etc. Each token attends to all others. [Part I Β§6.2]
Softmax
\(\text{softmax}(z_i) = e^{z_i} / \sum_j e^{z_j}\). Converts logits to probabilities. Used in attention weights and output layer. [Part I Β§1.2]
Speculative decoding
Use small draft model to propose multiple tokens, verify in parallel with large model. Mathematically equivalent output, 2β3Γ faster. [Part III Β§VIII]
Transformer
Architecture built on self-attention + FFN with residual connections and layer norm. Encoder, decoder, or encoder-decoder variants. Foundation of all modern LLMs. [Part I Β§6.4]
VAE (Variational Autoencoder)
Generative model with encoder (infers latent variables) and decoder (generates from latents). Trained by maximizing the ELBO. [Part V Β§2]
ViT (Vision Transformer)
Applies transformer to images by splitting into patches treated as tokens. Foundation of modern vision models and multimodal architectures. [Part III Β§VI]
Weight decay
Regularization that shrinks weights by a factor each step: \(w \leftarrow w(1 - \eta\lambda)\). Equivalent to L2 regularization for SGD. Decoupled in AdamW. [Part II Β§VII]
Word2Vec
Skip-gram or CBOW model for learning word embeddings. Trained to predict context words from center word (or vice versa). [Part I Β§5.1]
Zero-shot / Few-shot
Performing a task with zero or few examples in the prompt. Emergent capability of large language models. "Zero-shot": just a task description. "Few-shot": a handful of input-output examples.
Part I: Foundations β Core ML β Neural Networks β Transformers β LLMs
Part II: Backprop Walkthrough β Implementation β Training Pipeline β Inference
Part III: Efficient Attention β LoRA β Quantization β Reasoning β RAG
Part IV: Complete PyTorch Cookbook β 10 Implementations
Part V: Information Theory β VAEs β GANs β Diffusion β Geometry of Learning
Part VI: Production β Ecosystem β Safety β Paper Reading β Glossary
"The purpose of computing is insight, not numbers." β Richard Hamming
You now have the mathematical tools, implementation patterns, and research context
to not just use ML systems, but to understand and build them from first principles.
The next step is yours.