Lecture 21
RN 21.6 · "Attention Is All You Need" (Vaswani et al., 2017)
The state \(h_t = f(h_{t-1}, x_t)\) carries information forward, one step at a time.
Three problems at scale:
Attention replaces the step-by-step chain with direct connections: to build a token's new representation, it reads a weighted mix of all tokens, choosing the weights based on relevance.
Each of the \(n\) input tokens is mapped to a \(d\)-dimensional embedding. Stack them into a matrix:
\(X = \begin{bmatrix} \text{---}\ \mathbf{x}_1\ \text{---} \\ \vdots \\ \text{---}\ \mathbf{x}_n\ \text{---}\end{bmatrix} \in \mathbb{R}^{n \times d}\)
Row \(i\) is token \(i\)'s current representation. Attention will update every row.
From \(X\), three learned linear projections produce a query, key, and value for every token:
\(Q = XW_Q, \qquad K = XW_K, \qquad V = XW_V\)
Query \(\mathbf{q}_i\): "what am I looking for?"
Key \(\mathbf{k}_j\): "what do I offer?"
Value \(\mathbf{v}_j\): "what I pass on if attended to."
Token \(i\) attends to token \(j\) when query \(\mathbf{q}_i\) matches key \(\mathbf{k}_j\).
Score every query against every key (dot product), scale, softmax into weights, then mix the values:
\(\text{Attention}(Q,K,V) = \operatorname{softmax}\!\left(\dfrac{QK^\top}{\sqrt{d_k}}\right) V\)
Three tokens. Each row of \(\operatorname{softmax}(QK^\top/\sqrt{d_k})\) says how much a token attends to the others (rows sum to 1):
The red row: "cat" attends mostly to itself (0.6) and to "sat" (0.3).
Its new vector is the weighted sum of values:
\(\mathbf{z}_{\text{cat}} = 0.1\,\mathbf{v}_{\text{the}} + 0.6\,\mathbf{v}_{\text{cat}} + 0.3\,\mathbf{v}_{\text{sat}}\)
One attention pattern is limiting. Run \(h\) attention "heads" in parallel, each with its own \(W_Q, W_K, W_V\), then concatenate and project:
\(\text{head}_i = \text{Attention}(XW_Q^i, XW_K^i, XW_V^i)\)
\(\text{MHA}(X) = \big[\,\text{head}_1; \ldots; \text{head}_h\,\big]\,W_O\)
Attention is a weighted sum — it's permutation-invariant: shuffle the tokens and the output just shuffles too. So we add position information to the embeddings.
Sinusoidal encoding (original transformer):
\(PE_{(pos,\,2i)} = \sin\!\big(pos / 10000^{2i/d}\big)\)
\(PE_{(pos,\,2i+1)} = \cos\!\big(pos / 10000^{2i/d}\big)\)
Add \(PE\) to each token's embedding. Modern models often use learned positions or RoPE (rotary).
Each block = two sublayers, each wrapped in a residual connection and layer norm:
Every token attends to all tokens (bidirectional). Good for understanding a whole input.
A token may attend only to earlier tokens — so it can generate the next one without peeking ahead.
The causal mask sets future scores to \(-\infty\) before the softmax, zeroing those weights.
Every token attends to every token → the score matrix \(QK^\top\) is \(n\times n\). Cost is
\(O(n^2 d)\)
— quadratic in sequence length \(n\).
All positions at once — trains fast on GPUs/TPUs.
Any token reaches any other in one hop.
Add data + parameters and it keeps improving.
The same block powers vision, audio, protein folding — and every large language model.
L22: stack decoder blocks + train at scale → Large Language Models.