Part VIII โ€” Where the Map Ends

The Frontier: Open Problems
& the Future of AI

The unsolved questions, emerging paradigms, and research directions that will define
the next era of machine learning. What we know, what we don't, and what we're building toward.

Contents

  1. The Data Wall & Synthetic Data
    • Running out of internet ยท Self-play ยท Model-generated training data ยท Risks
  2. Beyond Next-Token Prediction
    • Limitations of autoregressive LMs ยท World models ยท Planning ยท Prediction vs understanding
  3. Reasoning & Formal Verification
    • Can LLMs reason? ยท Process reward models ยท Formal proofs ยท Neuro-symbolic
  4. Efficient Architectures
    • Post-transformer designs ยท SSMs ยท Hybrid models ยท Hardware co-design
  5. Mechanistic Interpretability
    • Understanding the black box ยท Features ยท Circuits ยท Sparse autoencoders ยท Superposition
  6. Multimodal & Embodied Intelligence
    • Native multimodality ยท Video understanding ยท Robotics ยท Embodied agents
  7. Alignment & Safety at Scale
    • Scalable oversight ยท Reward hacking ยท Corrigibility ยท AI governance
  8. The Big Questions
    • Does scale = intelligence? ยท Consciousness ยท Emergence ยท Limits of learning
Frontier I

The Data Wall & Synthetic Data

The Problem

Chinchilla (Part I ยง7.2) showed that compute-optimal training needs ~20 tokens per parameter. A 10T parameter model would need ~200 trillion tokens. But high-quality text on the entire internet is estimated at 10โ€“20 trillion tokens. We're approaching the ceiling of naturally-occurring training data.

How do we train models beyond the data wall? Current approaches include: repeating data with different orderings (diminishing returns beyond ~4 epochs), curating higher-quality subsets (quality > quantity), synthetic data generation by existing models, and learning from non-text modalities (video, audio, sensor data).

Synthetic Data โ€” Promise and Peril

Phi-3 Technical Report (Microsoft, 2024) โ€” Demonstrated that small models (3.8B) trained primarily on synthetic data generated by larger models can match much larger models on benchmarks. The key: carefully curated synthetic data targeted at specific reasoning capabilities.

Self-improvement loop: Model generates data โ†’ filter for quality โ†’ train next generation on filtered data. DeepSeek-R1 used this: RL for reasoning โ†’ generate reasoning traces โ†’ train smaller models on the traces. This is a form of knowledge distillation (Part III ยงVIII) at training-data scale.

Model collapse: Shumailov et al. (2023) showed that training on model-generated data can lead to "model collapse" โ€” progressive degradation of the data distribution. Each generation loses tail events and overrepresents the mode. The mathematical result: after enough generations, the learned distribution converges to a point mass. How do we use synthetic data without collapsing? Current solutions: always mix in real data, use diverse generation strategies, filter aggressively for quality and diversity.
ยท ยท ยท
Frontier II

Beyond Next-Token Prediction

The Autoregressive Limitation

An autoregressive LLM generates one token at a time, left to right. This has a profound limitation: every token gets the same amount of compute. The word "the" gets the same forward pass as a critical logical step. Humans don't think this way โ€” we allocate more thought to harder parts.

Current "reasoning models" (o1, R1) patch this by generating more tokens โ€” but the thinking is still in token-space, one at a time. Is there a better approach?

Contending Paradigms

World Models

Yann LeCun's vision (2022โ€“ongoing) โ€” "A Path Towards Autonomous Machine Intelligence." Proposes that next-token prediction is fundamentally insufficient for understanding the world. Instead, models should learn a "world model" โ€” an internal representation of how the world works โ€” and use it for planning and prediction in latent space, not token space.
AUTOREGRESSIVE LLM: "The ball is thrown" โ†’ predict next token โ†’ "upward" โ†’ next โ†’ "and" โ†’ ... (plans in token space, one word at a time) WORLD MODEL (proposed): [observe scene] โ†’ [build internal model of physics] โ†’ [simulate: ball trajectory in latent space] โ†’ [plan actions] (plans in latent space, then translates to tokens only at the end)

Joint Embedding Predictive Architecture (JEPA)

Instead of predicting exact pixels or tokens (which requires modeling irrelevant details like texture), predict abstract representations. Train by predicting the embedding of the next segment, not the raw content. This focuses the model on learning the structure that matters.

JEPA Objective (simplified) $$\mathcal{L} = \|\text{Predictor}(s_x, \Delta) - \text{sg}(\text{Encoder}(y))\|^2$$

Where \(s_x\) is the encoding of context, \(\Delta\) specifies what to predict, \(y\) is the target, and \(\text{sg}\) is stop-gradient (prevent collapse). The model learns to predict in representation space, not data space.

Can we combine the best of both worlds? Autoregressive LLMs have proven remarkably capable, but they waste compute on trivial predictions and struggle with planning. World models are theoretically elegant but haven't been made to work at scale for language. The frontier: hybrid architectures that use autoregressive generation for language output but internal latent-space reasoning for planning.
ยท ยท ยท
Frontier III

Reasoning & Formal Verification

Do LLMs Actually Reason?

This is one of the most debated questions in AI. Current LLMs exhibit behaviors that look like reasoning: they solve math problems, write proofs, debug code, and plan multi-step solutions. But there's a fundamental ambiguity: are they performing genuine logical reasoning, or sophisticated pattern matching on reasoning-like text?

Evidence for reasoning: Models solve novel problems not in their training data (verified by withholding problems). They transfer between formats (solve a word problem, then solve the same problem in code). Performance scales with chain-of-thought length in predictable ways. They discover valid mathematical proofs.

Evidence against reasoning: Models fail on trivially modified versions of familiar problems (change the numbers in a well-known puzzle and they fail). They're sensitive to irrelevant surface features. They can't reliably verify their own outputs. Performance degrades on out-of-distribution logical structures.

Process Reward Models

Lightman et al. (2023) โ€” "Let's Verify Step by Step." Train a reward model that evaluates each step of reasoning, not just the final answer. Process reward models (PRMs) dramatically improve mathematical reasoning over outcome reward models (ORMs). The key insight: rewarding correct reasoning process leads to more reliable answers than rewarding correct answers, because a model can reach a correct answer through flawed reasoning.
Process Reward vs Outcome Reward $$R_{\text{ORM}}(\text{solution}) = \mathbb{1}[\text{final answer is correct}]$$ $$R_{\text{PRM}}(\text{solution}) = \prod_{t=1}^{T} P(\text{step } t \text{ is correct} \mid \text{steps}_{1:t})$$

Formal Mathematics & Automated Theorem Proving

LLMs + formal proof assistants (Lean, Coq, Isabelle): Train LLMs to generate formal proofs that can be mechanically verified. This sidesteps the reasoning ambiguity entirely โ€” if the proof checks, the reasoning is correct by construction. AlphaProof (DeepMind, 2024) solved IMO problems by generating Lean proofs via search. This is a path to provably correct AI reasoning.

Neuro-Symbolic Integration

Combine the flexible pattern recognition of neural networks with the precise logical reasoning of symbolic systems. The LLM generates logical formulas or programs; a symbolic engine executes them exactly. This separates "understanding the problem" (neural) from "computing the answer" (symbolic).

The fundamental question: Is there a clean boundary between "understanding" and "reasoning," or are they the same thing at different levels of abstraction? If neural networks can be made large and well-trained enough, do they subsume symbolic reasoning? Or is there an irreducible gap between pattern matching and logical deduction?
ยท ยท ยท
Frontier IV

Efficient Architectures

Is the Transformer Optimal?

The transformer has been dominant since 2017, but its quadratic attention cost limits scalability. Several alternative architectures are competing:

ArchitectureComplexityStrengthsWeaknesses
Transformer\(O(n^2 d)\)Proven at all scales; rich research ecosystemQuadratic cost; large KV cache
Mamba (SSM)\(O(n d)\)Linear scaling; fast inference; no KV cacheWeaker on recall-heavy tasks
RWKV\(O(n d)\)Linear; trained like transformer, infers like RNNLimited in-context learning
Hyena\(O(n \log n)\)Long convolutions; no attention matrixLess proven at scale
Hybrid (Mamba + Attention)MixedBest of both: linear for most, attention where neededComplex engineering
Hybrid architectures are the likely near-term winner. Models like Jamba (AI21, 2024) interleave Mamba layers for efficiency with attention layers for recall-heavy tasks. The ratio of SSM-to-attention layers and which positions get attention are key design choices. Research question: what is the minimum number of attention layers needed to preserve full in-context learning ability?

Hardware-Architecture Co-Design

The transformer was designed for GPU parallelism. New architectures should be designed with hardware constraints: memory bandwidth, communication latency, chip topology. This "hardware lottery" (Hooker, 2020) determines which architectures succeed โ€” not just theoretical efficiency, but practical throughput on real hardware.

ยท ยท ยท
Frontier V

Mechanistic Interpretability

Part II ยงX introduced interpretability. Here we go deeper into the frontier of understanding how neural networks compute.

The Superposition Hypothesis

Elhage et al. (2022) โ€” "Toy Models of Superposition." Proved that neural networks can represent more features than they have dimensions by encoding features as nearly-orthogonal directions. Sparse features (rarely active together) can share dimensions without interference.
Superposition capacity $$\text{In } \mathbb{R}^d, \text{ you can fit } \sim \exp(c \cdot d) \text{ nearly-orthogonal vectors}$$

This means a 4096-dimensional residual stream can potentially represent millions of features. The cosine similarity between any two random vectors in high dimensions is approximately \(1/\sqrt{d}\) โ€” nearly zero. Features that rarely co-occur can safely share dimensions.

Sparse Autoencoders โ€” Extracting Features

Anthropic's "Scaling Monosemanticity" (2024) โ€” Trained sparse autoencoders on Claude's activations, finding millions of interpretable features: concepts, entities, syntactic roles, reasoning patterns. Many features correspond to human-understandable concepts (e.g., "Golden Gate Bridge", "sycophantic praise", "code bugs").
Sparse Autoencoder $$\mathbf{f} = \text{ReLU}(\mathbf{W}_{\text{enc}}(\mathbf{x} - \mathbf{b}_{\text{dec}}) + \mathbf{b}_{\text{enc}})$$ $$\hat{\mathbf{x}} = \mathbf{W}_{\text{dec}} \mathbf{f} + \mathbf{b}_{\text{dec}}$$ $$\mathcal{L} = \|\mathbf{x} - \hat{\mathbf{x}}\|^2 + \lambda \|\mathbf{f}\|_1$$

The encoder maps a \(d\)-dimensional activation to a much higher-dimensional (\(kd\), where \(k \gg 1\)) sparse feature space. The L1 penalty encourages most features to be zero โ€” only the relevant ones fire. Each column of \(\mathbf{W}_{\text{dec}}\) is a "feature direction" in activation space.

Can we fully understand a neural network? Current interpretability work has found individual features and small circuits, but we're far from a complete understanding of how a model produces any given output. The challenge: a 70B parameter model has trillions of possible circuits. We need automated tools that scale, not just manual inspection. This is arguably the most important research direction for AI safety.
ยท ยท ยท
Frontier VI

Multimodal & Embodied Intelligence

Native Multimodality

Current vision-language models bolt a vision encoder onto a text LLM (Part III ยงVI). The frontier is natively multimodal models that process all modalities from the start โ€” text, images, audio, video, and sensor data โ€” in a single unified architecture.

Token-based unification: Convert everything to tokens. Images become patch tokens (ViT), audio becomes spectrogram tokens, video becomes spatiotemporal tokens. A single transformer processes all modalities through the same attention mechanism. This is how models like Gemini and GPT-4o approach multimodality โ€” though the details of their architectures remain proprietary.

Video Understanding

Video is vastly richer than images: temporal dynamics, causality, physics, human behavior. A 1-minute video at 24fps with 256ร—256 resolution contains ~100M pixels โ€” far more than the context window of any current model. Key challenges: efficient temporal representation, long-range temporal reasoning, and learning physics from observation.

Embodied AI & Robotics

Foundation models for robotics: Train large models on diverse robot data (multiple robot types, environments, tasks), then fine-tune or prompt for specific tasks. This is the "ImageNet moment" for robotics that hasn't happened yet. Challenges: real-world data is expensive and dangerous to collect, sim-to-real transfer remains imperfect, and physical safety requirements are much stricter than text safety.
ยท ยท ยท
Frontier VII

Alignment & Safety at Scale

The Scalable Oversight Problem

Current alignment depends on humans evaluating model outputs. But what happens when models are smarter than the evaluators? A model that can write better code than any human programmer can also generate outputs that no human can fully verify. How do we ensure alignment when we can't check the work?

Proposed Solutions

Recursive reward modeling: Use AI to help humans evaluate AI. A more capable model assists the human evaluator, who then provides the training signal. The hope: the evaluation task is easier than the generation task, so the evaluator can be less capable than the model being evaluated.

Debate: Two AI models argue opposing sides; a human judge evaluates. In theory, the truthful model should be able to point out the deceptive model's flaws, even if the judge can't independently verify the claims.

Interpretability-based alignment: Instead of evaluating outputs, directly inspect the model's internal representations to verify it's "thinking" aligned thoughts. This requires mechanistic interpretability (Frontier V) to actually work.

Reward Hacking & Goodhart's Law

"When a measure becomes a target, it ceases to be a good measure." โ€” Goodhart's Law. Any reward model is an imperfect proxy for what we actually want. A sufficiently capable optimizer will find inputs that score high on the proxy but violate the spirit. Examples: a model trained to be "helpful" that agrees with everything (sycophancy), or a model trained to "avoid harmful outputs" that refuses everything (over-refusal).
The Overoptimization Problem $$\text{As } D_{\text{KL}}[\pi_\theta \| \pi_{\text{ref}}] \to \infty, \quad R_{\text{gold}}(\pi_\theta) \to \text{decreases}$$

Gao et al. (2023) showed empirically that as you optimize more aggressively against a reward model (increasing KL from the base policy), performance initially improves on the true reward, then degrades. The reward model is exploited beyond its region of validity.

The Governance Challenge

Technical alignment is necessary but not sufficient. Who decides what "aligned" means? Different cultures, legal systems, and individuals have different values. The governance question โ€” how AI development should be regulated, who has access to powerful models, and how to prevent misuse while enabling innovation โ€” is as important as the technical questions, and far less well-defined.

ยท ยท ยท
Frontier VIII

The Big Questions

Does Scale = Intelligence?

The "scaling hypothesis" โ€” that sufficient scale (parameters, data, compute) will lead to human-level intelligence โ€” is the central bet of the current AI paradigm. Evidence so far: every order of magnitude of scale has produced new capabilities that weren't predicted (in-context learning, chain-of-thought reasoning, code generation). But does this trend continue indefinitely?

The optimistic case: Intelligence is a computational phenomenon, and the transformer + next-token prediction is a sufficiently general learning algorithm. Scale plus data will eventually produce any cognitive capability humans have.

The skeptical case: Current LLMs learn correlations, not causation. They lack grounding, embodiment, and the ability to truly learn from experience. Scale gets you further on the correlation curve but never crosses to genuine understanding. New architectural ideas are needed.

Emergence โ€” What Creates New Capabilities?

Why do capabilities appear suddenly at certain scales? A model with 6B parameters can't do chain-of-thought math; a model with 60B parameters can. Where does this ability come from? Is it genuinely emergent (a qualitative phase transition), or is it a smooth improvement that crosses a visibility threshold? Recent work suggests many "emergent" abilities may be measurement artifacts โ€” they look discontinuous because of the evaluation metric, not the underlying capability. This remains actively debated.

The Nature of Understanding

When a language model generates a correct explanation of quantum mechanics, does it "understand" quantum mechanics? The Chinese Room argument (Searle, 1980) says no โ€” syntax manipulation doesn't constitute semantics. But this may be based on an intuition about understanding that doesn't scale. If a system can answer any question about quantum mechanics, use quantum reasoning to solve novel problems, detect errors in quantum reasoning, and generate new insights โ€” at what point is the distinction between "real understanding" and "mere simulation" meaningful?

A pragmatic perspective: Understanding is not a binary property. It's a spectrum of capability โ€” how broadly, flexibly, and reliably a system can use knowledge. By this measure, LLMs have genuine but limited understanding: broad but shallow, flexible but fragile, capable but unreliable. The frontier is pushing every dimension simultaneously.

Information-Theoretic Limits

The Data Processing Inequality $$X \to Y \to Z \implies I(X; Z) \leq I(X; Y)$$

No processing of the data can create information that wasn't there to begin with. A model trained on text can learn only what's in text โ€” it cannot learn the physical sensation of heat or the experience of seeing red. This is a hard mathematical limit. The question: how much of human-relevant knowledge is capturable in text? Probably much more than we initially thought (given GPT-4's capabilities), but probably not everything.

What Would It Take to Be Confident?

The honest answer: we don't have a theory of intelligence robust enough to predict when (or if) AI systems will match human cognition in all dimensions. We don't even have agreement on what "intelligence" means. What we do have are engineering systems that are incredibly useful and becoming more so every year, and a set of mathematical tools (the contents of this entire series) that explain how they work at the mechanistic level.

Questions to carry forward:
(a) If you were designing a benchmark to test whether a model truly understands a concept (vs. pattern matching), what would it look like?
(b) What's the minimum amount of architectural inductive bias a learning system needs? Could a model with zero prior structure (no attention, no convolutions, no recurrence) learn language if given enough data and compute?
(c) The information bottleneck (Part V ยง6) says neural networks compress representations to keep only task-relevant information. Does this mean models must learn something like understanding, or can they achieve high performance without it?
(d) Design a research agenda to determine whether LLMs have "world models" โ€” internal representations that capture causal structure of the world, not just statistical correlations.
ยท ยท ยท

The Complete Series

I.   Math Foundations โ†’ Core ML โ†’ Neural Networks โ†’ Transformers โ†’ Modern LLMs
II.   Backprop Walkthrough โ†’ Transformer Trace โ†’ Implementation โ†’ Training โ†’ Inference
III.   Efficient Attention โ†’ LoRA โ†’ Quantization โ†’ Reasoning โ†’ RAG โ†’ Multimodal
IV.   Complete PyTorch Cookbook โ€” 10 From-Scratch Implementations
V.   Information Theory โ†’ VAEs โ†’ GANs โ†’ Diffusion Models โ†’ Geometry of Learning
VI.   Production Serving โ†’ Open-Source Ecosystem โ†’ Safety โ†’ Paper Reading โ†’ Glossary
VII.   5 End-to-End Projects โ€” Tokenizer ยท GPT ยท RAG ยท LoRA ยท Agent
VIII.   The Frontier โ€” Open Problems & the Future of AI

"We are like butterflies who flutter for a day and think it is forever."

โ€” Carl Sagan

The field of AI is moving faster than any field in history.
Everything in this series โ€” 8 volumes, ~500 pages of mathematics,
code, and analysis โ€” captures a snapshot of early 2026.
Parts of it will be outdated within a year. The math won't be.

Dot products, gradients, probability, and information theory
are permanent. They are the bedrock on which every future
architecture, training method, and alignment technique will be built.

You now have that bedrock. Build on it.