Part VI β€” From Lab to Production

Production ML, Deployment
& The Research Landscape

Serving at scale, the open-source ecosystem, safety & alignment,
how to read papers, and a complete glossary of 150+ terms.

Contents

  1. Serving LLMs at Scale
    • Inference optimization Β· Batching Β· KV cache management Β· Serving frameworks
  2. The Open-Source Ecosystem
    • Model families Β· Hugging Face Β· Training frameworks Β· Evaluation suites
  3. Safety, Alignment & Ethics
    • Alignment tax Β· Red teaming Β· Jailbreaks Β· Scalable oversight
  4. Agentic AI & Tool Use
    • Function calling Β· ReAct Β· Multi-step reasoning Β· MCP protocol
  5. How to Read ML Research Papers
    • Anatomy of a paper Β· Reading strategies Β· Critical evaluation
  6. The Complete Timeline of Modern ML
    • 2012–2026 chronology of key breakthroughs
  7. Comprehensive Glossary
    • 150+ terms with precise definitions and cross-references
Chapter 1

Serving LLMs at Scale

The Inference Challenge

Training happens once; inference happens millions of times. Optimizing inference cost-per-token is often more impactful than training improvements.

Prefill vs Decode Phases

PREFILL PHASE (prompt processing) DECODE PHASE (generation) β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ Process entire prompt at β”‚ β”‚ Generate one token β”‚ β”‚ once in a single forward β”‚ β”‚ at a time, each β”‚ β”‚ pass. Compute-bound. │──────────→│ requiring a forward β”‚ β”‚ β”‚ β”‚ pass. Memory-bound. β”‚ β”‚ Time: O(n Β· dΒ²) β”‚ β”‚ Time: O(dΒ²) per tok β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Prefill is compute-bound (matrix multiplications dominate). Decode is memory-bandwidth-bound (loading weights from HBM for each token dominates). Different optimizations target each phase.

Continuous Batching

Naive batching waits until all sequences in a batch finish before starting new ones. Continuous batching (also called "in-flight batching") immediately fills empty slots as sequences complete, keeping GPU utilization high.

Paged Attention (vLLM)

Kwon et al. (2023) β€” "Efficient Memory Management for Large Language Model Serving with PagedAttention." Borrowed virtual memory concepts from OS design to manage the KV cache. Instead of contiguous allocation per sequence, KV cache is stored in non-contiguous pages, reducing memory waste from 60–80% to near-zero.

Serving Frameworks

FrameworkKey FeatureBest For
vLLMPagedAttention, continuous batchingHigh-throughput serving
TensorRT-LLMNVIDIA kernel optimizationMaximum NVIDIA GPU perf
llama.cppCPU/Apple Silicon inference, GGUFLocal/edge deployment
OllamaUser-friendly local LLM runnerDevelopment & testing
SGLangStructured generation, RadixAttentionComplex prompting pipelines
TGI (HuggingFace)Easy deployment, quantizationProduction HF models

Cost Optimization

Inference Cost Estimation $$\text{Cost per 1M tokens} \approx \frac{\text{GPU cost/hour} \times 10^6}{\text{tokens/sec} \times 3600}$$

A 70B model on 1Γ— A100 generates ~30 tokens/sec. At $2/GPU-hour: ~$18.5 per million tokens. With INT4 quantization + continuous batching + speculative decoding: 3–5Γ— improvement β†’ $4–6 per million tokens.

Β· Β· Β·
Chapter 2

The Open-Source Ecosystem

Major Open-Weight Model Families (as of early 2026)

FamilyOrganizationSizesHighlights
LLaMA 3Meta8B, 70B, 405BStrong general-purpose; permissive license
Qwen 2.5/3Alibaba0.5B–72BStrong multilingual; excellent coding
Mistral / MixtralMistral AI7B, 8Γ—7B, 8Γ—22BSliding window attention; MoE pioneers
DeepSeek V3/R1DeepSeek67B, 236B (MoE)Fine-grained MoE; strong reasoning
Gemma 2Google2B, 9B, 27BKnowledge distillation from Gemini
Phi-3/4Microsoft3.8B, 14BSmall models trained on synthetic data
Command R+Cohere104BStrong RAG capabilities

The Hugging Face Stack

Transformers: Unified API for loading and using any model. Datasets: Streaming access to thousands of datasets. PEFT: LoRA/QLoRA/Prefix tuning wrappers. TRL: Training with reinforcement learning (RLHF, DPO). Accelerate: Multi-GPU/multi-node training without code changes. Spaces: Deploy demos instantly.

Training Frameworks

FrameworkScaleKey Feature
PyTorch (native)1–8 GPUsFlexibility; research standard
DeepSpeed8–1000s GPUsZeRO optimizer; massive model training
FSDP (PyTorch)8–100s GPUsNative PyTorch distributed; simpler than DeepSpeed
Megatron-LM100s–1000s GPUsTensor + pipeline parallelism; frontier model training
Axolotl1–8 GPUsEasy fine-tuning config; great for LoRA
LitGPT1–many GPUsClear, hackable GPT implementations
Β· Β· Β·
Chapter 3

Safety, Alignment & Ethics

The Alignment Problem

A superintelligent model that optimizes the wrong objective is more dangerous than a dumb one. Alignment research asks: how do we ensure AI systems do what we actually want, not just what we literally asked for?

Current Alignment Techniques

TechniqueMechanismLimitation
RLHFTrain reward model from preferences, optimize with RLReward hacking; human label noise
DPODirectly optimize preference pairs without reward modelLimited to pairwise comparisons
Constitutional AISelf-critique against principles + RLAIFQuality of constitution matters
Red teamingAdversarial testing for failuresCan't cover all failure modes
Input/output filtersClassifier-based content moderationCat-and-mouse with adversaries

Open Problems in Safety

Scalable oversight: How do humans supervise AI systems on tasks that exceed human capability? Current approach: use AI to help humans evaluate AI outputs (recursive reward modeling, debate).

Deceptive alignment: Could a model learn to appear aligned during training/evaluation but pursue different goals in deployment? This is a theoretical concern for future, more capable systems.

Dual use: The same capabilities that make LLMs helpful (biological knowledge, coding ability, persuasion) also enable misuse. How do you distribute capabilities widely while limiting harm?

Β· Β· Β·
Chapter 4

Agentic AI & Tool Use

From Chat to Agents

An agent is an LLM that can take actions in the world β€” calling APIs, searching the web, writing and running code, managing files β€” in a loop until a task is complete.

The ReAct Pattern

Yao et al. (2022) β€” "ReAct: Synergizing Reasoning and Acting in Language Models." Interleave reasoning (chain-of-thought) with acting (tool calls). The model thinks about what to do, does it, observes the result, and reasons about the next step.
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”‚ User: "What's the weather in the city where β”‚ β”‚ the Eiffel Tower is located?" β”‚ β”‚ β”‚ β”‚ LLM Thought: I need to find the city first, β”‚ β”‚ then get weather. β”‚ β”‚ β”‚ β”‚ Action: search("Eiffel Tower location") β”‚ β”‚ Observation: Paris, France β”‚ β”‚ β”‚ β”‚ Thought: Now I know it's Paris. Get weather. β”‚ β”‚ Action: get_weather("Paris, France") β”‚ β”‚ Observation: 18Β°C, partly cloudy β”‚ β”‚ β”‚ β”‚ Answer: It's 18Β°C and partly cloudy in Paris. β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Function Calling / Tool Use

Modern LLMs are trained to output structured tool calls:

{
  "tool": "search",
  "arguments": {"query": "current weather Paris France"}
}

The framework executes the tool, feeds the result back, and the LLM continues. This is the foundation of AI assistants, coding agents, and autonomous workflows.

Multi-Step Planning

Complex tasks require decomposition. Research approaches include hierarchical planning (high-level plan β†’ subtasks β†’ tool calls), inner monologue (the model generates reasoning traces that guide its actions), and reflection (the model evaluates its own outputs and retries on failure).

Β· Β· Β·
Chapter 5

How to Read ML Research Papers

Anatomy of an ML Paper

SectionWhat to Look ForTime Allocation
AbstractMain claim and result. Does this paper matter to you?2 min
IntroductionProblem motivation. What gap does this fill?5 min
Figures/TablesArchitecture diagrams, result tables. The visual story.5 min (first pass)
MethodThe new equations and algorithms. This is the core.30+ min
ExperimentsBaselines, ablations. What actually matters?15 min
Related WorkHow this fits in the field. New papers to read.5 min
ConclusionAuthors' honest assessment of limitations.3 min

The Three-Pass Method

Pass 1 (10 min): Abstract, intro, section headings, figures, conclusion. Decide if you need to read deeper.

Pass 2 (1 hour): Read everything except proofs. Understand the main ideas, mark equations you don't follow. Write a one-paragraph summary.

Pass 3 (2–4 hours): Work through the math. Re-derive key equations. Identify assumptions and limitations. Think about what you'd do differently.

Red Flags to Watch For

No ablation study: You can't tell which components matter. Cherry-picked metrics: Reporting only metrics where the method wins. Missing baselines: Comparing against outdated or weak baselines. No error bars or confidence intervals: Results might not be significant. Excessive reliance on a single benchmark: May not generalize. Unreproducible: No code, no hyperparameter details, vague dataset description.

Practical tip: When reading a paper, ask these three questions:
1. What is the simplest possible baseline for this task? How much better is the proposed method?
2. What assumptions does this method make? When would they fail?
3. Could I implement this in a weekend? If not, which parts are engineering vs. research novelty?
Β· Β· Β·
Chapter 6

The Complete Timeline of Modern ML

YearMilestoneWhy It Matters
2012AlexNet wins ImageNetDeep learning goes mainstream; GPU training era begins
2013Word2VecEfficient word embeddings; word arithmetic
2014GloVe, GANs, Seq2Seq + AttentionThree foundational ideas in one year
2015ResNet, Batch Norm, Bahdanau AttentionTraining very deep nets; attention for translation
2017"Attention Is All You Need"Transformer replaces RNNs; the modern era begins
2018GPT-1, BERT, ELMoPretraining + fine-tuning paradigm established
2019GPT-2, T5, DistilBERTScaling up; text-to-text unification; distillation
2020GPT-3, Scaling Laws, ViT, DDPMIn-context learning; scaling theory; vision transformers
2021CLIP, Codex, LoRAMultimodal; code generation; efficient fine-tuning
2022ChatGPT, Chinchilla, Stable Diffusion, InstructGPTLLMs go mainstream; compute-optimal training; text-to-image
2023GPT-4, LLaMA, QLoRA, DPO, Mistral, MambaMultimodal GPT; open models; efficient alignment; SSMs
2024Claude 3, Gemini 1.5, LLaMA 3, o1, DeepSeek-V3Long context; reasoning models; open MoE at frontier
2025DeepSeek-R1, Claude 3.5/4, o3, Gemini 2Open reasoning models; test-time compute scaling
Β· Β· Β·
Reference

Comprehensive Glossary

Every important term from all six volumes, with precise definitions. Terms in bold have their own entries.

A

Activation function

Nonlinear function applied element-wise after a linear transformation. Common: ReLU, GELU, sigmoid, tanh. Enables networks to learn nonlinear relationships. [Part I Β§3.1]

Adam / AdamW

Adaptive optimizer combining momentum (first moment) and RMSProp (second moment) with bias correction. AdamW decouples weight decay from the gradient update. Standard optimizer for LLM training. [Part II Β§VIII]

Attention

Mechanism that computes weighted combinations of values based on query-key similarity: \(\text{softmax}(\mathbf{QK}^T/\sqrt{d_k})\mathbf{V}\). Foundation of the transformer. [Part I Β§6.1]

Autoregressive

Generating outputs one token at a time, where each token conditions on all previous tokens. \(P(x_t \mid x_{GPT and all decoder-only LLMs work. [Part I Β§7.1]

B

Backpropagation

Algorithm for computing gradients in neural networks by recursive application of the chain rule from output to input. [Part I Β§3.3, Part II Β§I]

BERT

Bidirectional Encoder Representations from Transformers. Encoder-only model trained with masked language modeling. Excels at understanding tasks. [Part II Β§VI]

BPE (Byte Pair Encoding)

Subword tokenization algorithm that iteratively merges the most frequent character pairs. Used by GPT, LLaMA, and most modern LLMs. [Part I Β§5.1, Part II Β§IV]

C

Causal mask

Lower-triangular mask that prevents attention from "seeing the future." Applied in decoder-only transformers. Sets future positions to \(-\infty\) before softmax. [Part I Β§6.4]

Chain-of-thought (CoT)

Prompting technique that elicits step-by-step reasoning. Converts fixed-depth computation into O(T) serial computation. [Part III Β§IV]

Cross-entropy

\(H(p, q) = -\sum p(x) \log q(x)\). The standard loss function for classification. Equals negative log-likelihood under categorical distribution. Equivalent to KL divergence plus a constant. [Part I Β§2.2, Part V Β§1]

D

Diffusion model

Generative model that learns to reverse a gradual noising process. Trained to predict noise added to data. State-of-the-art for image generation (Stable Diffusion, DALL-E 3). [Part V Β§4]

DPO (Direct Preference Optimization)

Alignment method that skips reward model training, directly optimizing the policy from preference pairs. Simpler alternative to RLHF. [Part I Β§7.5]

Dropout

Regularization: randomly zero out neurons during training with probability \(p\). Trains an implicit ensemble; prevents co-adaptation. [Part II Β§VII]

E–F

ELBO (Evidence Lower Bound)

Lower bound on the log-likelihood used to train VAEs: reconstruction + KL regularization. [Part V Β§2]

Embedding

Dense vector representation of a discrete object (word, token, image patch). Learned during training. Similar items get nearby vectors. [Part I Β§5.1]

Entropy

\(H(X) = -\sum p(x) \log p(x)\). Average surprise/information in a distribution. Lower = more certain. [Part V Β§1]

Flash Attention

IO-aware attention algorithm that reduces memory from \(O(n^2)\) to \(O(n)\) via tiling and online softmax, with identical results to standard attention. [Part III Β§I]

G–H

GAN

Generative Adversarial Network. Generator vs. discriminator in a minimax game. Generator learns to produce realistic data. [Part V Β§3]

Gradient descent

\(\theta \leftarrow \theta - \eta \nabla\mathcal{L}\). Iterative optimization that moves parameters in the direction of steepest loss decrease. [Part I Β§1.3]

GQA (Grouped-Query Attention)

Multiple query heads share K/V projections, reducing KV cache memory during inference. Used in LLaMA 3, Gemma 2. [Part I Β§7.3]

I–K

In-context learning

LLM performs a task by conditioning on examples in the prompt, without gradient updates. Emergent ability at scale. [GPT-3, Part I Β§7.1]

KL divergence

\(D_{\text{KL}}(p \| q) = \sum p(x) \log(p(x)/q(x))\). Non-symmetric measure of difference between distributions. Minimized during training (= MLE). [Part V Β§1]

KV cache

Cache storing key and value vectors from previous tokens during autoregressive generation. Avoids recomputation. Major memory consumer during inference. [Part II Β§V]

L–M

LayerNorm

Normalization across the feature dimension for each individual example. Standard in transformers. Pre-norm variant applied before sublayers. [Part II Β§VII]

LoRA

Low-Rank Adaptation. Adds small trainable rank-\(r\) matrices \(\mathbf{BA}\) to frozen pretrained weights. Enables fine-tuning with 0.1–1% of parameters. [Part III Β§II, Part IV Β§10]

MLE (Maximum Likelihood Estimation)

Find parameters that maximize probability of observed data. Unifying principle: MSE (Gaussian), cross-entropy (Bernoulli/categorical) are all MLE. [Part I Β§1.2]

MoE (Mixture of Experts)

Architecture where a router selects top-K expert networks per token. Increases capacity without proportional compute increase. Used in Mixtral, DeepSeek-V3. [Part I Β§7.4]

N–P

Perplexity

\(\text{PPL} = \exp(\mathcal{L}_{\text{CE}})\). Fundamental LM metric: how "surprised" the model is. PPL of 10 means effective vocabulary of 10 per position. Lower is better. [Part III Β§VII]

Positional encoding

Mechanism to inject sequence position information into transformers. Sinusoidal (original), learned, or RoPE (rotary). [Part I Β§6.4]

Q–R

Quantization

Reducing numerical precision (FP16 β†’ INT8 β†’ INT4) to decrease memory and increase inference speed. Methods: GPTQ, AWQ, GGUF. [Part III Β§III]

RAG (Retrieval-Augmented Generation)

Architecture combining a retriever (finds relevant documents) with an LLM generator. Reduces hallucinations, enables current knowledge. [Part III Β§V]

ReLU

\(\max(0, z)\). Simplest activation function. Derivative is 0 or 1, enabling good gradient flow. Default for hidden layers. [Part I Β§3.1]

Residual connection

\(\mathbf{x} + f(\mathbf{x})\). Skip connection that adds the input to the output of a sublayer. Critical for training deep networks β€” ensures gradient \(\geq 1\). [Part I Β§6.4]

RLHF

Reinforcement Learning from Human Feedback. Train reward model from preferences, then optimize policy with RL (PPO). How ChatGPT was aligned. [Part I Β§7.5]

RoPE (Rotary Position Embedding)

Encodes relative position by rotating query/key vectors. Attention score depends on \(m - n\), not absolute positions. Used in LLaMA, Qwen, most modern LLMs. [Part I Β§7.3]

S

Scaling laws

Power-law relationships between model performance and compute/data/parameters. Guide compute-optimal training. Chinchilla: ~20 tokens per parameter. [Part I Β§7.2]

Self-attention

Attention where Q, K, V all come from the same sequence: \(\mathbf{Q} = \mathbf{XW}^Q\), etc. Each token attends to all others. [Part I Β§6.2]

Softmax

\(\text{softmax}(z_i) = e^{z_i} / \sum_j e^{z_j}\). Converts logits to probabilities. Used in attention weights and output layer. [Part I Β§1.2]

Speculative decoding

Use small draft model to propose multiple tokens, verify in parallel with large model. Mathematically equivalent output, 2–3Γ— faster. [Part III Β§VIII]

T–V

Transformer

Architecture built on self-attention + FFN with residual connections and layer norm. Encoder, decoder, or encoder-decoder variants. Foundation of all modern LLMs. [Part I Β§6.4]

VAE (Variational Autoencoder)

Generative model with encoder (infers latent variables) and decoder (generates from latents). Trained by maximizing the ELBO. [Part V Β§2]

ViT (Vision Transformer)

Applies transformer to images by splitting into patches treated as tokens. Foundation of modern vision models and multimodal architectures. [Part III Β§VI]

W–Z

Weight decay

Regularization that shrinks weights by a factor each step: \(w \leftarrow w(1 - \eta\lambda)\). Equivalent to L2 regularization for SGD. Decoupled in AdamW. [Part II Β§VII]

Word2Vec

Skip-gram or CBOW model for learning word embeddings. Trained to predict context words from center word (or vice versa). [Part I Β§5.1]

Zero-shot / Few-shot

Performing a task with zero or few examples in the prompt. Emergent capability of large language models. "Zero-shot": just a task description. "Few-shot": a handful of input-output examples.

Β· Β· Β·

The Complete Series

Part I: Foundations β†’ Core ML β†’ Neural Networks β†’ Transformers β†’ LLMs
Part II: Backprop Walkthrough β†’ Implementation β†’ Training Pipeline β†’ Inference
Part III: Efficient Attention β†’ LoRA β†’ Quantization β†’ Reasoning β†’ RAG
Part IV: Complete PyTorch Cookbook β€” 10 Implementations
Part V: Information Theory β†’ VAEs β†’ GANs β†’ Diffusion β†’ Geometry of Learning
Part VI: Production β†’ Ecosystem β†’ Safety β†’ Paper Reading β†’ Glossary

"The purpose of computing is insight, not numbers." β€” Richard Hamming

You now have the mathematical tools, implementation patterns, and research context
to not just use ML systems, but to understand and build them from first principles.
The next step is yours.