Part VI — From Lab to Production

Production ML, Deployment
& The Research Landscape

Serving at scale, the open-source ecosystem, safety & alignment,
how to read papers, and a complete glossary of 150+ terms.

Serving LLMs at Scale
- Inference optimization · Batching · KV cache management · Serving frameworks
The Open-Source Ecosystem
- Model families · Hugging Face · Training frameworks · Evaluation suites
Safety, Alignment & Ethics
- Alignment tax · Red teaming · Jailbreaks · Scalable oversight
Agentic AI & Tool Use
- Function calling · ReAct · Multi-step reasoning · MCP protocol
How to Read ML Research Papers
- Anatomy of a paper · Reading strategies · Critical evaluation
The Complete Timeline of Modern ML
- 2012–2026 chronology of key breakthroughs
Comprehensive Glossary
- 150+ terms with precise definitions and cross-references

Chapter 1

Serving LLMs at Scale

The Inference Challenge

Training happens once; inference happens millions of times. Optimizing inference cost-per-token is often more impactful than training improvements.

Prefill vs Decode Phases

PREFILL PHASE (prompt processing) DECODE PHASE (generation) ┌─────────────────────────────┐ ┌──────────────────────┐ │ Process entire prompt at │ │ Generate one token │ │ once in a single forward │ │ at a time, each │ │ pass. Compute-bound. │──────────→│ requiring a forward │ │ │ │ pass. Memory-bound. │ │ Time: O(n · d²) │ │ Time: O(d²) per tok │ └─────────────────────────────┘ └──────────────────────┘

Prefill is compute-bound (matrix multiplications dominate). Decode is memory-bandwidth-bound (loading weights from HBM for each token dominates). Different optimizations target each phase.

Continuous Batching

Naive batching waits until all sequences in a batch finish before starting new ones. Continuous batching (also called "in-flight batching") immediately fills empty slots as sequences complete, keeping GPU utilization high.

Paged Attention (vLLM)

Kwon et al. (2023) — "Efficient Memory Management for Large Language Model Serving with PagedAttention." Borrowed virtual memory concepts from OS design to manage the KV cache. Instead of contiguous allocation per sequence, KV cache is stored in non-contiguous pages, reducing memory waste from 60–80% to near-zero.

Serving Frameworks

Framework	Key Feature	Best For
vLLM	PagedAttention, continuous batching	High-throughput serving
TensorRT-LLM	NVIDIA kernel optimization	Maximum NVIDIA GPU perf
llama.cpp	CPU/Apple Silicon inference, GGUF	Local/edge deployment
Ollama	User-friendly local LLM runner	Development & testing
SGLang	Structured generation, RadixAttention	Complex prompting pipelines
TGI (HuggingFace)	Easy deployment, quantization	Production HF models

Cost Optimization

Inference Cost Estimation $$\text{Cost per 1M tokens} \approx \frac{\text{GPU cost/hour} \times 10^6}{\text{tokens/sec} \times 3600}$$

A 70B model on 1× A100 generates ~30 tokens/sec. At $2/GPU-hour: ~$18.5 per million tokens. With INT4 quantization + continuous batching + speculative decoding: 3–5× improvement → $4–6 per million tokens.

· · ·

Chapter 2

The Open-Source Ecosystem

Major Open-Weight Model Families (as of early 2026)

Family	Organization	Sizes	Highlights
LLaMA 3	Meta	8B, 70B, 405B	Strong general-purpose; permissive license
Qwen 2.5/3	Alibaba	0.5B–72B	Strong multilingual; excellent coding
Mistral / Mixtral	Mistral AI	7B, 8×7B, 8×22B	Sliding window attention; MoE pioneers
DeepSeek V3/R1	DeepSeek	67B, 236B (MoE)	Fine-grained MoE; strong reasoning
Gemma 2	Google	2B, 9B, 27B	Knowledge distillation from Gemini
Phi-3/4	Microsoft	3.8B, 14B	Small models trained on synthetic data
Command R+	Cohere	104B	Strong RAG capabilities

The Hugging Face Stack

Transformers: Unified API for loading and using any model. Datasets: Streaming access to thousands of datasets. PEFT: LoRA/QLoRA/Prefix tuning wrappers. TRL: Training with reinforcement learning (RLHF, DPO). Accelerate: Multi-GPU/multi-node training without code changes. Spaces: Deploy demos instantly.

Training Frameworks

Framework	Scale	Key Feature
PyTorch (native)	1–8 GPUs	Flexibility; research standard
DeepSpeed	8–1000s GPUs	ZeRO optimizer; massive model training
FSDP (PyTorch)	8–100s GPUs	Native PyTorch distributed; simpler than DeepSpeed
Megatron-LM	100s–1000s GPUs	Tensor + pipeline parallelism; frontier model training
Axolotl	1–8 GPUs	Easy fine-tuning config; great for LoRA
LitGPT	1–many GPUs	Clear, hackable GPT implementations

· · ·

Chapter 3

Safety, Alignment & Ethics

The Alignment Problem

A superintelligent model that optimizes the wrong objective is more dangerous than a dumb one. Alignment research asks: how do we ensure AI systems do what we actually want, not just what we literally asked for?

Current Alignment Techniques

Technique	Mechanism	Limitation
RLHF	Train reward model from preferences, optimize with RL	Reward hacking; human label noise
DPO	Directly optimize preference pairs without reward model	Limited to pairwise comparisons
Constitutional AI	Self-critique against principles + RLAIF	Quality of constitution matters
Red teaming	Adversarial testing for failures	Can't cover all failure modes
Input/output filters	Classifier-based content moderation	Cat-and-mouse with adversaries

Open Problems in Safety

Scalable oversight: How do humans supervise AI systems on tasks that exceed human capability? Current approach: use AI to help humans evaluate AI outputs (recursive reward modeling, debate).

Deceptive alignment: Could a model learn to appear aligned during training/evaluation but pursue different goals in deployment? This is a theoretical concern for future, more capable systems.

Dual use: The same capabilities that make LLMs helpful (biological knowledge, coding ability, persuasion) also enable misuse. How do you distribute capabilities widely while limiting harm?

· · ·

Chapter 4

Agentic AI & Tool Use

From Chat to Agents

An agent is an LLM that can take actions in the world — calling APIs, searching the web, writing and running code, managing files — in a loop until a task is complete.

The ReAct Pattern

Yao et al. (2022) — "ReAct: Synergizing Reasoning and Acting in Language Models." Interleave reasoning (chain-of-thought) with acting (tool calls). The model thinks about what to do, does it, observes the result, and reasons about the next step.

┌─────────────────────────────────────────────────┐ │ User: "What's the weather in the city where │ │ the Eiffel Tower is located?" │ │ │ │ LLM Thought: I need to find the city first, │ │ then get weather. │ │ │ │ Action: search("Eiffel Tower location") │ │ Observation: Paris, France │ │ │ │ Thought: Now I know it's Paris. Get weather. │ │ Action: get_weather("Paris, France") │ │ Observation: 18°C, partly cloudy │ │ │ │ Answer: It's 18°C and partly cloudy in Paris. │ └─────────────────────────────────────────────────┘

Function Calling / Tool Use

Modern LLMs are trained to output structured tool calls:

{
  "tool": "search",
  "arguments": {"query": "current weather Paris France"}
}

The framework executes the tool, feeds the result back, and the LLM continues. This is the foundation of AI assistants, coding agents, and autonomous workflows.

Multi-Step Planning

Complex tasks require decomposition. Research approaches include hierarchical planning (high-level plan → subtasks → tool calls), inner monologue (the model generates reasoning traces that guide its actions), and reflection (the model evaluates its own outputs and retries on failure).

· · ·

Chapter 5

How to Read ML Research Papers

Anatomy of an ML Paper

Section	What to Look For	Time Allocation
Abstract	Main claim and result. Does this paper matter to you?	2 min
Introduction	Problem motivation. What gap does this fill?	5 min
Figures/Tables	Architecture diagrams, result tables. The visual story.	5 min (first pass)
Method	The new equations and algorithms. This is the core.	30+ min
Experiments	Baselines, ablations. What actually matters?	15 min
Related Work	How this fits in the field. New papers to read.	5 min
Conclusion	Authors' honest assessment of limitations.	3 min

The Three-Pass Method

Pass 1 (10 min): Abstract, intro, section headings, figures, conclusion. Decide if you need to read deeper.

Pass 2 (1 hour): Read everything except proofs. Understand the main ideas, mark equations you don't follow. Write a one-paragraph summary.

Pass 3 (2–4 hours): Work through the math. Re-derive key equations. Identify assumptions and limitations. Think about what you'd do differently.

Red Flags to Watch For

No ablation study: You can't tell which components matter. Cherry-picked metrics: Reporting only metrics where the method wins. Missing baselines: Comparing against outdated or weak baselines. No error bars or confidence intervals: Results might not be significant. Excessive reliance on a single benchmark: May not generalize. Unreproducible: No code, no hyperparameter details, vague dataset description.

Practical tip: When reading a paper, ask these three questions:
1. What is the simplest possible baseline for this task? How much better is the proposed method?
2. What assumptions does this method make? When would they fail?
3. Could I implement this in a weekend? If not, which parts are engineering vs. research novelty?

· · ·

Chapter 6

The Complete Timeline of Modern ML

Year	Milestone	Why It Matters
2012	AlexNet wins ImageNet	Deep learning goes mainstream; GPU training era begins
2013	Word2Vec	Efficient word embeddings; word arithmetic
2014	GloVe, GANs, Seq2Seq + Attention	Three foundational ideas in one year
2015	ResNet, Batch Norm, Bahdanau Attention	Training very deep nets; attention for translation
2017	"Attention Is All You Need"	Transformer replaces RNNs; the modern era begins
2018	GPT-1, BERT, ELMo	Pretraining + fine-tuning paradigm established
2019	GPT-2, T5, DistilBERT	Scaling up; text-to-text unification; distillation
2020	GPT-3, Scaling Laws, ViT, DDPM	In-context learning; scaling theory; vision transformers
2021	CLIP, Codex, LoRA	Multimodal; code generation; efficient fine-tuning
2022	ChatGPT, Chinchilla, Stable Diffusion, InstructGPT	LLMs go mainstream; compute-optimal training; text-to-image
2023	GPT-4, LLaMA, QLoRA, DPO, Mistral, Mamba	Multimodal GPT; open models; efficient alignment; SSMs
2024	Claude 3, Gemini 1.5, LLaMA 3, o1, DeepSeek-V3	Long context; reasoning models; open MoE at frontier
2025	DeepSeek-R1, Claude 3.5/4, o3, Gemini 2	Open reasoning models; test-time compute scaling

· · ·

Reference

Comprehensive Glossary

Every important term from all six volumes, with precise definitions. Terms in bold have their own entries.

A

Activation function

Nonlinear function applied element-wise after a linear transformation. Common: ReLU, GELU, sigmoid, tanh. Enables networks to learn nonlinear relationships. [Part I §3.1]

Adam / AdamW

Adaptive optimizer combining momentum (first moment) and RMSProp (second moment) with bias correction. AdamW decouples weight decay from the gradient update. Standard optimizer for LLM training. [Part II §VIII]

Attention

Mechanism that computes weighted combinations of values based on query-key similarity: $\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\mathbf{V}$. Foundation of the transformer. [Part I §6.1]

Autoregressive

Generating outputs one token at a time, where each token conditions on all previous tokens. \(P(x_t \mid x_{GPT and all decoder-only LLMs work. [Part I §7.1]

B

Backpropagation

Algorithm for computing gradients in neural networks by recursive application of the chain rule from output to input. [Part I §3.3, Part II §I]

BERT

Bidirectional Encoder Representations from Transformers. Encoder-only model trained with masked language modeling. Excels at understanding tasks. [Part II §VI]

BPE (Byte Pair Encoding)

Subword tokenization algorithm that iteratively merges the most frequent character pairs. Used by GPT, LLaMA, and most modern LLMs. [Part I §5.1, Part II §IV]

C

Causal mask

Lower-triangular mask that prevents attention from "seeing the future." Applied in decoder-only transformers. Sets future positions to $-\infty$ before softmax. [Part I §6.4]

Chain-of-thought (CoT)

Prompting technique that elicits step-by-step reasoning. Converts fixed-depth computation into O(T) serial computation. [Part III §IV]

Cross-entropy

$H(p, q) = -\sum p(x) \log q(x)$. The standard loss function for classification. Equals negative log-likelihood under categorical distribution. Equivalent to KL divergence plus a constant. [Part I §2.2, Part V §1]

D

Diffusion model

Generative model that learns to reverse a gradual noising process. Trained to predict noise added to data. State-of-the-art for image generation (Stable Diffusion, DALL-E 3). [Part V §4]

DPO (Direct Preference Optimization)

Alignment method that skips reward model training, directly optimizing the policy from preference pairs. Simpler alternative to RLHF. [Part I §7.5]

Dropout

Regularization: randomly zero out neurons during training with probability $p$. Trains an implicit ensemble; prevents co-adaptation. [Part II §VII]

E–F

ELBO (Evidence Lower Bound)

Lower bound on the log-likelihood used to train VAEs: reconstruction + KL regularization. [Part V §2]

Embedding

Dense vector representation of a discrete object (word, token, image patch). Learned during training. Similar items get nearby vectors. [Part I §5.1]

Entropy

$H(X) = -\sum p(x) \log p(x)$. Average surprise/information in a distribution. Lower = more certain. [Part V §1]

Flash Attention

IO-aware attention algorithm that reduces memory from $O(n^2)$ to $O(n)$ via tiling and online softmax, with identical results to standard attention. [Part III §I]

G–H

GAN

Generative Adversarial Network. Generator vs. discriminator in a minimax game. Generator learns to produce realistic data. [Part V §3]

Gradient descent

$\theta \leftarrow \theta - \eta \nabla\mathcal{L}$. Iterative optimization that moves parameters in the direction of steepest loss decrease. [Part I §1.3]

GQA (Grouped-Query Attention)

Multiple query heads share K/V projections, reducing KV cache memory during inference. Used in LLaMA 3, Gemma 2. [Part I §7.3]

I–K

In-context learning

LLM performs a task by conditioning on examples in the prompt, without gradient updates. Emergent ability at scale. [GPT-3, Part I §7.1]

KL divergence

$D_{\text{KL}}(p \| q) = \sum p(x) \log(p(x)/q(x))$. Non-symmetric measure of difference between distributions. Minimized during training (= MLE). [Part V §1]

KV cache

Cache storing key and value vectors from previous tokens during autoregressive generation. Avoids recomputation. Major memory consumer during inference. [Part II §V]

L–M

LayerNorm

Normalization across the feature dimension for each individual example. Standard in transformers. Pre-norm variant applied before sublayers. [Part II §VII]

LoRA

Low-Rank Adaptation. Adds small trainable rank-$r$ matrices $\mathbf{BA}$ to frozen pretrained weights. Enables fine-tuning with 0.1–1% of parameters. [Part III §II, Part IV §10]

MLE (Maximum Likelihood Estimation)

Find parameters that maximize probability of observed data. Unifying principle: MSE (Gaussian), cross-entropy (Bernoulli/categorical) are all MLE. [Part I §1.2]

MoE (Mixture of Experts)

Architecture where a router selects top-K expert networks per token. Increases capacity without proportional compute increase. Used in Mixtral, DeepSeek-V3. [Part I §7.4]

N–P

Perplexity

$\text{PPL} = \exp(\mathcal{L}_{\text{CE}})$. Fundamental LM metric: how "surprised" the model is. PPL of 10 means effective vocabulary of 10 per position. Lower is better. [Part III §VII]

Positional encoding

Mechanism to inject sequence position information into transformers. Sinusoidal (original), learned, or RoPE (rotary). [Part I §6.4]

Q–R

Quantization

Reducing numerical precision (FP16 → INT8 → INT4) to decrease memory and increase inference speed. Methods: GPTQ, AWQ, GGUF. [Part III §III]

RAG (Retrieval-Augmented Generation)

Architecture combining a retriever (finds relevant documents) with an LLM generator. Reduces hallucinations, enables current knowledge. [Part III §V]

ReLU

$\max(0, z)$. Simplest activation function. Derivative is 0 or 1, enabling good gradient flow. Default for hidden layers. [Part I §3.1]

Residual connection

$\mathbf{x} + f(\mathbf{x})$. Skip connection that adds the input to the output of a sublayer. Critical for training deep networks — ensures gradient $\geq 1$. [Part I §6.4]

RLHF

Reinforcement Learning from Human Feedback. Train reward model from preferences, then optimize policy with RL (PPO). How ChatGPT was aligned. [Part I §7.5]

RoPE (Rotary Position Embedding)

Encodes relative position by rotating query/key vectors. Attention score depends on $m - n$, not absolute positions. Used in LLaMA, Qwen, most modern LLMs. [Part I §7.3]

S

Scaling laws

Power-law relationships between model performance and compute/data/parameters. Guide compute-optimal training. Chinchilla: ~20 tokens per parameter. [Part I §7.2]

Self-attention

Attention where Q, K, V all come from the same sequence: $\mathbf{Q} = \mathbf{XW}^Q$, etc. Each token attends to all others. [Part I §6.2]

Softmax

$\text{softmax}(z_i) = e^{z_i} / \sum_j e^{z_j}$. Converts logits to probabilities. Used in attention weights and output layer. [Part I §1.2]

Speculative decoding

Use small draft model to propose multiple tokens, verify in parallel with large model. Mathematically equivalent output, 2–3× faster. [Part III §VIII]

T–V

Transformer

Architecture built on self-attention + FFN with residual connections and layer norm. Encoder, decoder, or encoder-decoder variants. Foundation of all modern LLMs. [Part I §6.4]

VAE (Variational Autoencoder)

Generative model with encoder (infers latent variables) and decoder (generates from latents). Trained by maximizing the ELBO. [Part V §2]

ViT (Vision Transformer)

Applies transformer to images by splitting into patches treated as tokens. Foundation of modern vision models and multimodal architectures. [Part III §VI]

W–Z

Weight decay

Regularization that shrinks weights by a factor each step: $w \leftarrow w(1 - \eta\lambda)$. Equivalent to L2 regularization for SGD. Decoupled in AdamW. [Part II §VII]

Word2Vec

Skip-gram or CBOW model for learning word embeddings. Trained to predict context words from center word (or vice versa). [Part I §5.1]

Zero-shot / Few-shot

Performing a task with zero or few examples in the prompt. Emergent capability of large language models. "Zero-shot": just a task description. "Few-shot": a handful of input-output examples.

· · ·

The Complete Series

Part I: Foundations → Core ML → Neural Networks → Transformers → LLMs
Part II: Backprop Walkthrough → Implementation → Training Pipeline → Inference
Part III: Efficient Attention → LoRA → Quantization → Reasoning → RAG
Part IV: Complete PyTorch Cookbook — 10 Implementations
Part V: Information Theory → VAEs → GANs → Diffusion → Geometry of Learning
Part VI: Production → Ecosystem → Safety → Paper Reading → Glossary

"The purpose of computing is insight, not numbers." — Richard Hamming

You now have the mathematical tools, implementation patterns, and research context
to not just use ML systems, but to understand and build them from first principles.
The next step is yours.

Production ML, Deployment& The Research Landscape

Contents

Serving LLMs at Scale

The Inference Challenge

Prefill vs Decode Phases

Continuous Batching

Paged Attention (vLLM)

Serving Frameworks

Cost Optimization

The Open-Source Ecosystem

Major Open-Weight Model Families (as of early 2026)

The Hugging Face Stack

Training Frameworks

Safety, Alignment & Ethics

The Alignment Problem

Current Alignment Techniques

Open Problems in Safety

Agentic AI & Tool Use

From Chat to Agents

The ReAct Pattern

Function Calling / Tool Use

Multi-Step Planning

How to Read ML Research Papers

Anatomy of an ML Paper

The Three-Pass Method

Red Flags to Watch For

The Complete Timeline of Modern ML

Comprehensive Glossary

A

B

C

D

E–F

G–H

I–K

L–M

N–P

Q–R

S

T–V

W–Z

The Complete Series

Production ML, Deployment
& The Research Landscape