Write the autoregressive LM objective and next-token loss.
Explain tokenization, pretraining, and scaling laws.
Describe fine-tuning and RLHF / DPO alignment.
Explain in-context learning and decoding.
Describe LLM systems: agents, tool use, and RAG.
What is a large language model?
A decoder-only transformer (L21), stacked \(N\) blocks deep and trained to do one thing: predict the next token.
Everything else — writing code, answering questions, translating — emerges from doing that one task extremely well, at scale.
Step 0: tokenization (BPE)
Text is first split into subword tokens from a fixed vocabulary (~50k). Byte-Pair Encoding starts from characters and repeatedly merges the most frequent pair.
"unbelievable" →
unbelievable
Common words = one token; rare words split into pieces.
Each token maps to an id, then to an embedding vector (the \(X\) rows from L21).
The model predicts a distribution over the whole vocabulary at each step.
The training objective: predict the next token
Factor the probability of a sequence left to right, and train to maximize it — i.e. minimize next-token cross-entropy:
The text is the label — no human annotation needed (self-supervised).
Pretraining: one task, enormous scale
Train on trillions of tokens of text (web, books, code).
Self-supervised → no labels to collect; just predict the next token.
The result is a base model: a broad next-token predictor, not yet a helpful assistant.
Why does simply scaling this up work so well?
Scaling laws
Test loss falls as a smooth power law as we scale model size, data, and compute together.
Predictable: you can forecast a bigger model's loss before training it.
Chinchilla: for a compute budget, balance parameters and tokens (don't just grow the model).
Scaling also unlocks emergent abilities not present in small models.
From base model to assistant: fine-tuning
A base model completes text; it doesn't reliably follow instructions. Supervised fine-tuning (SFT) continues training on curated pairs:
instruction tuning
Train on \(\{(\text{instruction},\ \text{ideal response})\}\) with the same next-token loss — now the "answer" is a good, on-task response written by humans.
Teaches format and helpfulness, but human-written answers are scarce and don't capture preferences between two decent replies.
Alignment: RLHF (and DPO)
Collect human preferences (which of two responses is better), fit a reward model \(r\), then optimize the policy \(\pi\) — this is the RL from L15, applied to text: