Part VII — Learning by Building

End-to-End Projects

Five complete projects that tie together every concept from Parts I–VI.
Each project: problem definition → math → architecture → code → evaluation → extensions.

Projects

Project 1: Build a Tokenizer from Scratch (BPE)
- Byte-pair encoding · Vocabulary construction · Encode/decode · Comparison with tiktoken
Project 2: Train a Character-Level GPT on Shakespeare
- Data pipeline · nanoGPT architecture · Training · Generation · Ablation study
Project 3: Build a Semantic Search Engine (RAG Pipeline)
- Document chunking · Embedding · FAISS index · Retrieval · LLM generation
Project 4: Fine-Tune an LLM with LoRA
- Dataset preparation · QLoRA setup · Training · Merge · Evaluation
Project 5: Build a Reasoning Agent with Tool Use
- ReAct loop · Tool definitions · Multi-step planning · Error recovery

Project 1

Build a BPE Tokenizer from Scratch

Build a complete Byte Pair Encoding tokenizer that can train on arbitrary text, encode strings to token IDs, and decode back. This connects to Part I §5.1 (subword tokenization) and Part II §IV (tokenization in the training pipeline).

Why This Matters

Tokenization is the first and last step in every LLM pipeline. Understanding it deeply means understanding why "tokenization" becomes ["token", "ization"] but "untokenize" becomes ["un", "token", "ize"]. Bugs here silently corrupt everything downstream.

The Algorithm

INPUT: "low low low low low lowest lowest newer newer newer wider wider" STEP 0: Split to characters ['l','o','w',' ','l','o','w',' ','l','o','w',' ','l','o','w',' ',...] STEP 1: Count all adjacent pairs ('l','o'): 10, ('o','w'): 10, ('w',' '): 7, ('e','r'): 5, ... STEP 2: Merge most frequent pair ('l','o') → 'lo' ['lo','w',' ','lo','w',' ','lo','w',' ','lo','w',' ',...] STEP 3: Recount pairs, merge ('lo','w') → 'low' ['low',' ','low',' ','low',' ','low',' ',...] STEP 4: Continue until desired vocab size reached Final vocab: {a-z, 'low', 'er', 'est', 'new', 'wid', ...}

Complete Implementation

class BPETokenizer:
    """
    Byte Pair Encoding tokenizer, built from scratch.
    Maps to Part I §5.1 and Karpathy's minbpe.
    """
    
    def __init__(self):
        self.merges = {}        # (pair) → new_token_id
        self.vocab = {}         # token_id → bytes
    
    def _get_pair_counts(self, ids):
        """Count frequency of each adjacent pair."""
        counts = {}
        for i in range(len(ids) - 1):
            pair = (ids[i], ids[i+1])
            counts[pair] = counts.get(pair, 0) + 1
        return counts
    
    def _merge_pair(self, ids, pair, new_id):
        """Replace all occurrences of pair with new_id."""
        new_ids = []
        i = 0
        while i < len(ids):
            if i < len(ids) - 1 and (ids[i], ids[i+1]) == pair:
                new_ids.append(new_id)
                i += 2
            else:
                new_ids.append(ids[i])
                i += 1
        return new_ids
    
    def train(self, text, vocab_size):
        """
        Train BPE on text. Start with 256 byte-level tokens,
        merge pairs until we reach vocab_size.
        """
        # Start with raw bytes
        tokens = list(text.encode("utf-8"))
        
        # Initialize vocab with single bytes (0-255)
        self.vocab = {i: bytes([i]) for i in range(256)}
        
        num_merges = vocab_size - 256
        
        for i in range(num_merges):
            counts = self._get_pair_counts(tokens)
            if not counts:
                break
            
            # Find most frequent pair
            best_pair = max(counts, key=counts.get)
            new_id = 256 + i
            
            # Record the merge
            self.merges[best_pair] = new_id
            self.vocab[new_id] = self.vocab[best_pair[0]] + self.vocab[best_pair[1]]
            
            # Apply merge to token sequence
            tokens = self._merge_pair(tokens, best_pair, new_id)
            
            if (i+1) % 100 == 0:
                print(f"Merge {i+1}/{num_merges}: {best_pair} → {new_id}"
                      f" ('{self.vocab[new_id].decode('utf-8', errors='replace')}')"
                      f" | Tokens: {len(tokens)}")
        
        print(f"Compression: {len(text.encode('utf-8'))} bytes → {len(tokens)} tokens"
              f" ({len(text.encode('utf-8'))/len(tokens):.1f}x)")
    
    def encode(self, text):
        """Encode text to token IDs."""
        tokens = list(text.encode("utf-8"))
        # Apply merges in order they were learned
        for pair, new_id in self.merges.items():
            tokens = self._merge_pair(tokens, pair, new_id)
        return tokens
    
    def decode(self, ids):
        """Decode token IDs back to text."""
        raw_bytes = b"".join(self.vocab[i] for i in ids)
        return raw_bytes.decode("utf-8", errors="replace")


# ── Usage ──
tokenizer = BPETokenizer()

# Train on Shakespeare (or any text)
with open("shakespeare.txt") as f:
    text = f.read()

tokenizer.train(text, vocab_size=512)    # 256 bytes + 256 merges

# Test roundtrip
encoded = tokenizer.encode("To be or not to be")
decoded = tokenizer.decode(encoded)
assert decoded == "To be or not to be"
print(f"Tokens: {encoded}")
print(f"Token strings: {[tokenizer.vocab[t].decode('utf-8', errors='?') for t in encoded]}")

Key insights to verify experimentally:
1. Common words like "the", "and", "to" get merged into single tokens early.
2. Morphemes like "-ing", "-tion", "-ed" emerge as tokens naturally.
3. Compression ratio typically reaches 3–4× on English text with vocab size 512, and 4–6× with vocab size 50K (GPT-2 level).
4. Rare words decompose into subword pieces — the tokenizer handles unseen words gracefully.

Extensions:
(a) Add special tokens: <|endoftext|>, <|pad|>.
(b) Compare your tokenizer's output against tiktoken (GPT-4's tokenizer) for the same text. Where do they differ?
(c) Train on code (Python) vs English prose. How do the learned merges differ?
(d) Implement the regex-based pre-tokenization that GPT-4 uses (split on whitespace/punctuation boundaries before BPE).

· · ·

Project 2

Train a Character-Level GPT on Shakespeare

Train the GPT model from Part IV §8 on the complete works of Shakespeare. Produce a model that generates convincing Shakespeare-like text. Then run ablation studies to understand what each component contributes.

Data Pipeline

import torch
from torch.utils.data import Dataset, DataLoader

class CharDataset(Dataset):
    """
    Character-level dataset for language modeling.
    Each item: (input[0:T], target[1:T+1]) — shifted by one.
    """
    def __init__(self, text, block_size):
        self.block_size = block_size
        
        # Build character vocabulary
        chars = sorted(set(text))
        self.char_to_idx = {c: i for i, c in enumerate(chars)}
        self.idx_to_char = {i: c for c, i in self.char_to_idx.items()}
        self.vocab_size = len(chars)
        
        # Encode entire text
        self.data = torch.tensor([self.char_to_idx[c] for c in text], dtype=torch.long)
        print(f"Vocab size: {self.vocab_size} | Data length: {len(self.data):,} chars")
    
    def __len__(self):
        return len(self.data) - self.block_size
    
    def __getitem__(self, idx):
        chunk = self.data[idx : idx + self.block_size + 1]
        x = chunk[:-1]    # input:  chars 0 to T-1
        y = chunk[1:]      # target: chars 1 to T  (shifted right)
        return x, y

# ── Setup ──
# Download: wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
with open("input.txt") as f:
    text = f.read()

# 90/10 train/val split
split = int(0.9 * len(text))
train_dataset = CharDataset(text[:split], block_size=256)
val_dataset = CharDataset(text[split:], block_size=256)

train_loader = DataLoader(train_dataset, batch_size=64, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=64)

Model Configuration & Training

from gpt import GPT, GPTConfig  # Part IV §8

config = GPTConfig()
config.vocab_size = train_dataset.vocab_size   # ~65 chars
config.block_size = 256
config.n_layer = 6
config.n_head = 6
config.n_embd = 384
config.dropout = 0.2

model = GPT(config)
# ~10.6M parameters — trains in ~10 min on a single GPU

# ── Training with the loop from Part IV §9 ──
# After ~5000 steps, the model generates:
#
# ROMEO:
# What say'st thou? I will not be thy love,
# That hath so long been absent from thy state,
# And yet I know not what to say to thee.

Ablation Study: What Matters?

Remove or modify one component at a time and measure the effect on validation loss:

Ablation	Val Loss	Δ vs Baseline	Takeaway
Full model (baseline)	1.48	—	—
Remove positional encoding	1.85	+0.37	Position info is critical
1 head instead of 6	1.62	+0.14	Multi-head helps significantly
No residual connections	2.30	+0.82	Training collapses without them
No layer norm	1.95	+0.47	Stabilization is essential
ReLU instead of GELU	1.51	+0.03	Activation choice matters less
No dropout	1.52	+0.04	Slight overfitting at this scale
2 layers instead of 6	1.68	+0.20	Depth matters for quality
FFN 2× instead of 4×	1.55	+0.07	FFN capacity moderately important

What the ablation reveals: Residual connections > positional encoding > layer norm > multi-head attention > depth > FFN width > activation function > dropout. The "boring" architectural choices (residuals, normalization) matter far more than "exciting" ones (activation functions, dropout rate). This matches the theoretical analysis from Part I §6.4 and Part II §VII.

Extensions:
(a) Plot training loss vs validation loss over time. At what point does overfitting begin?
(b) Visualize the attention patterns in your trained model. Do different heads specialize?
(c) Implement nucleus sampling (top-p) and compare generation quality at p=0.9, 0.95, 0.99.
(d) Scale the model to 50M parameters (n_layer=12, n_embd=768). How does the loss curve change?

· · ·

Project 3

Build a Semantic Search Engine (RAG Pipeline)

Build a complete retrieval-augmented generation pipeline: ingest documents, chunk them, embed them, store in a vector index, retrieve relevant chunks for a query, and generate an answer using an LLM. This ties together Part I §5.1 (embeddings), Part I §8.3 (similarity metrics), and Part III §V (RAG).

Architecture

┌──────────────────────────────────────────────────────────────┐ │ INDEXING PIPELINE (offline) │ │ │ │ Documents → Chunk (512 tokens, 50 overlap) → Embed → FAISS │ └──────────────────────────────────────────────────────────────┘ ┌──────────────────────────────────────────────────────────────┐ │ QUERY PIPELINE (online) │ │ │ │ Query → Embed → FAISS search → Top-5 chunks → Prompt → LLM │ └──────────────────────────────────────────────────────────────┘

Implementation

import numpy as np
from sentence_transformers import SentenceTransformer
import faiss

# ═══════════════════════════════════════════
# STEP 1: Document Chunking
# ═══════════════════════════════════════════

def chunk_text(text, chunk_size=500, overlap=50):
    """Split text into overlapping chunks by character count."""
    chunks = []
    start = 0
    while start < len(text):
        end = start + chunk_size
        # Find nearest sentence boundary
        if end < len(text):
            boundary = text.rfind('.', start + chunk_size//2, end)
            if boundary != -1:
                end = boundary + 1
        chunks.append(text[start:end].strip())
        start = end - overlap
    return [c for c in chunks if len(c) > 50]  # filter tiny chunks


# ═══════════════════════════════════════════
# STEP 2: Embedding & Indexing
# ═══════════════════════════════════════════

# Load embedding model (Part I §5.1 — modern contextual embeddings)
embedder = SentenceTransformer("all-MiniLM-L6-v2")  # 384-dim, fast

def build_index(chunks):
    """Embed chunks and build FAISS index."""
    embeddings = embedder.encode(chunks, show_progress_bar=True)
    embeddings = np.array(embeddings, dtype=np.float32)
    
    # Normalize for cosine similarity (Part I §8.3)
    faiss.normalize_L2(embeddings)
    
    # Build index — Inner Product on normalized vectors = cosine similarity
    dim = embeddings.shape[1]
    index = faiss.IndexFlatIP(dim)
    index.add(embeddings)
    
    print(f"Indexed {len(chunks)} chunks, dim={dim}")
    return index, embeddings


# ═══════════════════════════════════════════
# STEP 3: Retrieval
# ═══════════════════════════════════════════

def retrieve(query, index, chunks, k=5):
    """Retrieve top-k most relevant chunks for a query."""
    q_emb = embedder.encode([query]).astype(np.float32)
    faiss.normalize_L2(q_emb)
    
    # score(q, c) = cos_sim(q, c) — Part I §8.3
    scores, indices = index.search(q_emb, k)
    
    results = []
    for score, idx in zip(scores[0], indices[0]):
        results.append({
            "chunk": chunks[idx],
            "score": float(score),
            "index": int(idx)
        })
    return results


# ═══════════════════════════════════════════
# STEP 4: Generate Answer with LLM
# ═══════════════════════════════════════════

def generate_answer(query, retrieved_chunks, llm_client):
    """Construct prompt with context and generate answer."""
    context = "\n\n---\n\n".join(r["chunk"] for r in retrieved_chunks)
    
    prompt = f"""Answer the question based on the context below.
If the context doesn't contain the answer, say "I don't have enough information."

Context:
{context}

Question: {query}
Answer:"""
    
    # Call your LLM (local or API)
    response = llm_client.generate(prompt, max_tokens=500)
    return response


# ═══════════════════════════════════════════
# FULL PIPELINE
# ═══════════════════════════════════════════

# 1. Ingest
with open("documents.txt") as f:
    raw_text = f.read()
chunks = chunk_text(raw_text)

# 2. Index
index, embeddings = build_index(chunks)

# 3. Query
query = "How does attention work in transformers?"
results = retrieve(query, index, chunks, k=5)

# 4. Generate
answer = generate_answer(query, results, llm_client)
print(answer)

The math behind retrieval:
Each chunk is a point in \(\mathbb{R}^{384}\). The query is another point. We find the \(k\) nearest neighbors by cosine similarity: \(\text{score} = \frac{\mathbf{q}^T \mathbf{c}}{\|\mathbf{q}\|\|\mathbf{c}\|}\). After L2 normalization, cosine similarity equals inner product, which FAISS computes efficiently using optimized BLAS routines. For 1M chunks, search takes ~1ms.

Extensions:
(a) Add a reranker: use a cross-encoder model to re-score the top-20 results and keep top-5. Measure recall improvement.
(b) Implement hybrid search: combine FAISS (semantic) with BM25 (lexical) using reciprocal rank fusion.
(c) Add query expansion: use the LLM to generate a hypothetical answer (HyDE), embed that, and search.
(d) Evaluate retrieval quality: for 50 test queries with known relevant chunks, compute Recall@5 and MRR.

· · ·

Project 4

Fine-Tune an LLM with LoRA

Take a pretrained open-weight LLM (e.g., Qwen2.5-1.5B or Phi-3-mini-3.8B), fine-tune it on a custom instruction dataset using QLoRA, merge the weights, and evaluate. This connects Part III §II (LoRA), Part IV §10 (LoRA code), and Part II §IV (training pipeline).

Complete Pipeline

# ═══════════════════════════════════════════
# STEP 1: Prepare Instruction Dataset
# ═══════════════════════════════════════════

def format_instruction(example):
    """Convert to chat format for fine-tuning."""
    return {
        "text": f"""<|user|>
{example['instruction']}
<|assistant|>
{example['output']}<|end|>"""
    }

# Load and format dataset
from datasets import load_dataset
dataset = load_dataset("tatsu-lab/alpaca", split="train")
dataset = dataset.map(format_instruction)


# ═══════════════════════════════════════════
# STEP 2: Load Model in 4-bit (QLoRA)
# ═══════════════════════════════════════════

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
from peft import LoraConfig, get_peft_model, prepare_model_for_kbit_training

# 4-bit quantization config (Part III §III)
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",          # NormalFloat4 — optimal for Gaussian weights
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,     # quantize the quantization constants too
)

model_id = "Qwen/Qwen2.5-1.5B"
model = AutoModelForCausalLM.from_pretrained(
    model_id, quantization_config=bnb_config, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_id)
tokenizer.pad_token = tokenizer.eos_token

# Prepare for LoRA training
model = prepare_model_for_kbit_training(model)


# ═══════════════════════════════════════════
# STEP 3: Apply LoRA (Part III §II)
# ═══════════════════════════════════════════

lora_config = LoraConfig(
    r=16,                          # rank — 8 to 64 typical
    lora_alpha=32,                 # scaling factor (α/r applied to ΔW)
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# Trainable: ~4.7M / 1.5B total = 0.31%


# ═══════════════════════════════════════════
# STEP 4: Train
# ═══════════════════════════════════════════

from trl import SFTTrainer, SFTConfig

training_args = SFTConfig(
    output_dir="./lora-output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,    # effective batch = 16
    learning_rate=2e-4,               # higher than full fine-tuning
    lr_scheduler_type="cosine",
    warmup_ratio=0.05,
    bf16=True,
    logging_steps=25,
    save_strategy="epoch",
    max_seq_length=1024,
    dataset_text_field="text",
)

trainer = SFTTrainer(
    model=model,
    args=training_args,
    train_dataset=dataset,
    tokenizer=tokenizer,
)
trainer.train()


# ═══════════════════════════════════════════
# STEP 5: Merge & Save
# ═══════════════════════════════════════════

# Merge LoRA weights back into base (Part III §II: W = W₀ + BA)
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged-model")
tokenizer.save_pretrained("./merged-model")
# The merged model has zero additional inference cost

Common LoRA training mistakes:
1. Learning rate too low (use 1e-4 to 3e-4 for LoRA, not 1e-5 like full fine-tuning).
2. Training too long on small datasets — LoRA overfits quickly. Monitor val loss; 1–3 epochs is typical.
3. Forgetting to apply LoRA to FFN layers (gate_proj, up_proj, down_proj) — these contain most of the model's "knowledge."
4. Not using the correct chat template for the base model — each model family has its own format.

· · ·

Project 5

Build a Reasoning Agent with Tool Use

Build an LLM agent that can reason step-by-step, call tools (calculator, web search, code executor), handle errors, and solve multi-step problems. This connects Part III §IV (reasoning/CoT) and Part VI §4 (agentic AI/ReAct).

The ReAct Loop

import json, re, math

# ═══════════════════════════════════════════
# TOOL DEFINITIONS
# ═══════════════════════════════════════════

TOOLS = {
    "calculator": {
        "description": "Evaluate a mathematical expression. Input: expression string.",
        "function": lambda expr: str(eval(expr, {"__builtins__": {}}, 
                                            {"sqrt": math.sqrt, "pi": math.pi,
                                             "log": math.log, "exp": math.exp}))
    },
    "search": {
        "description": "Search the web for information. Input: search query.",
        "function": lambda q: web_search(q)    # your search API wrapper
    },
    "python": {
        "description": "Execute Python code. Input: code string. Returns stdout.",
        "function": lambda code: run_python_sandbox(code)
    },
}

# ═══════════════════════════════════════════
# SYSTEM PROMPT
# ═══════════════════════════════════════════

SYSTEM_PROMPT = """You are a helpful assistant that solves problems step by step.

You have access to these tools:
{tool_descriptions}

To use a tool, respond with:
THOUGHT: [your reasoning about what to do next]
ACTION: [tool_name]
INPUT: [tool input]

After receiving a result, continue reasoning:
THOUGHT: [interpret the result]
...

When you have the final answer:
THOUGHT: [final reasoning]
ANSWER: [your final answer]

Always think before acting. If a tool call fails, try a different approach."""


# ═══════════════════════════════════════════
# REACT AGENT LOOP
# ═══════════════════════════════════════════

class ReActAgent:
    def __init__(self, llm_client, tools, max_steps=10):
        self.llm = llm_client
        self.tools = tools
        self.max_steps = max_steps
    
    def run(self, query):
        # Build tool descriptions for the prompt
        tool_desc = "\n".join(
            f"- {name}: {t['description']}"
            for name, t in self.tools.items()
        )
        system = SYSTEM_PROMPT.format(tool_descriptions=tool_desc)
        
        messages = [
            {"role": "system", "content": system},
            {"role": "user", "content": query},
        ]
        
        for step in range(self.max_steps):
            # Get LLM response
            response = self.llm.chat(messages)
            messages.append({"role": "assistant", "content": response})
            
            print(f"\n── Step {step+1} ──\n{response}")
            
            # Check for final answer
            if "ANSWER:" in response:
                answer = response.split("ANSWER:")[1].strip()
                return {"answer": answer, "steps": step + 1}
            
            # Extract and execute tool call
            action_match = re.search(r"ACTION:\s*(\w+)", response)
            input_match = re.search(r"INPUT:\s*(.+)", response, re.DOTALL)
            
            if action_match and input_match:
                tool_name = action_match.group(1).strip()
                tool_input = input_match.group(1).strip()
                
                if tool_name in self.tools:
                    try:
                        result = self.tools[tool_name]["function"](tool_input)
                        observation = f"OBSERVATION: {result}"
                    except Exception as e:
                        observation = f"OBSERVATION: Error — {str(e)}"
                else:
                    observation = f"OBSERVATION: Unknown tool '{tool_name}'"
                
                messages.append({"role": "user", "content": observation})
                print(observation)
        
        return {"answer": "Max steps reached", "steps": self.max_steps}

# ── Usage ──
agent = ReActAgent(llm_client, TOOLS)
result = agent.run("What is the square root of the population of Tokyo?")
# Step 1: THOUGHT: I need to find Tokyo's population. ACTION: search INPUT: Tokyo population
# Step 2: OBSERVATION: 13.96 million (2023)
# Step 3: THOUGHT: Now calculate sqrt(13960000). ACTION: calculator INPUT: sqrt(13960000)
# Step 4: OBSERVATION: 3736.3...
# Step 5: THOUGHT: I have the answer. ANSWER: approximately 3,736

The mathematics of agentic AI:
An agent is a sequential decision-making system. At each step, the agent observes the environment state (conversation history + tool outputs), selects an action (next tool call or final answer), and receives feedback (tool result). This maps naturally to a Partially Observable Markov Decision Process (POMDP). The LLM's "reasoning" in the THOUGHT steps serves as a belief state that summarizes relevant history — exactly what the hidden state did in LSTMs (Part I §4.2), but now in natural language.

Extensions:
(a) Add a memory tool that lets the agent save and retrieve key-value pairs across a conversation.
(b) Implement self-reflection: after each answer, the agent critiques its own response and may retry.
(c) Build a code-writing agent: the agent writes Python code, executes it, reads the output, and iterates until the code is correct. Track how many iterations different problem types require.
(d) Implement parallel tool calls: when the agent needs multiple independent pieces of information, dispatch tool calls concurrently.

· · ·

The Five Projects, Connected

Project 1 (Tokenizer) feeds into Project 2 (GPT) — you can swap in your BPE tokenizer.
Project 2 (GPT) gives you a model that Project 4 (LoRA) can fine-tune.
Project 3 (RAG) provides context that Project 5 (Agent) can search and reason over.
Project 5 (Agent) can use the model from Project 4 with the search from Project 3.

Together, they form a complete LLM system: tokenize → pretrain → embed → retrieve → fine-tune → deploy as agent.