Lecture 17
RN 21.7.1 · PM 10.2 · GBC 14.1
🎥 This lecture is pre-recorded — I'm away at ICML this week, so there's no in-person class today. Watch at your own pace.
📅 Due today (Tue Jul 7): Chat 8 and the CS686 project proposal.
💬 Questions? Post on Piazza — the TAs are available, and I'll follow up when I'm back.
No labels — only the raw data \(\{x_i\}\). What can we learn?
Find a low-dimensional summary of each example.
Clustering, PCA, autoencoders.
Learn the data distribution so we can sample new examples.
GANs, VAEs, diffusion.
Today: clustering & PCA, then how modern models learn representations without labels.
Group training examples into clusters, treating them like discovered categories.
Each \(x\) gets exactly one cluster.
\(\text{class}(x) = c\)
Each \(x\) gets a distribution over clusters.
\(\text{class}(x) \sim P(C \mid x)\)
Each color is a discovered cluster — no labels were given.
Inputs: number of clusters \(k\) and \(m\) training examples \(x_i \in \mathbb{R}^n\). Each cluster has a centroid \(c \in \mathbb{R}^n\); each \(x_i\) is assigned to its closest centroid.
\(d(c, x) = \sqrt{\sum_{j = 1}^{n} (c_j - x_j)^2}\)
Triangles = current centroids; circles = data points.
Alternate two steps until convergence:
Same 12 points, three snapshots: random init → mid-iteration → converged.
Centroids drift; assignments stabilize; cost decreases monotonically.
\(k = 2\), Euclidean (squared) distance. Data + initial centroids:
| example | \(x_1\) | \(x_2\) | \(x_3\) |
|---|---|---|---|
| 1 | 0.2 | 0.5 | 0 |
| 2 | −0.6 | 2.1 | 1.2 |
| 3 | −0.5 | 1.9 | 1.3 |
| 4 | 0.1 | 0.5 | −0.3 |
\(c_1 = (0.3, 0.8, -0.5), \;\; c_2 = (-0.1, -0.5, 1.0)\).
Assign step (squared distance to each centroid):
Update step:
\(c_1' = (-0.2, 1.25, 0.55), \;\; c_2' = \varnothing\)
Watch out: \(c_2\) got no examples. Re-initialize it (random restart).
Larger \(k\) always gives lower training error — but defeats the point of compression.
\(s(x) = \dfrac{b(x) - a(x)}{\max(a(x),\, b(x))}\)
High-dimensional data often lives on a low-dimensional manifold. Extra dimensions are mostly noise.
Goal: recover the intrinsic 1D coordinate along the blue ribbon.
Find orthogonal axes that successively maximize the variance of the projected data.
Fraction of variance kept by the \(k\)-th PC: \(\dfrac{\lambda_k}{\sum_j \lambda_j}\). Cumulative across top \(k\) PCs guides how many to keep.
A neural net learns to squeeze and reconstruct its input.
Single linear layer; weights shared between encoder and decoder.
\(\hat{z} = W x, \quad \hat{x} = W^\top \hat{z}\)
With squared loss, the solution recovers the same subspace as PCA.
Stack non-linear layers (ReLU, etc.) in both encoder and decoder; train end-to-end with backprop.
Captures non-linear manifolds — a building block for the learned representations below.
Labels are expensive. Self-supervised learning invents the supervision signal from the data itself, then learns a reusable representation.
CLIP: an image encoder \(f\) and a text encoder \(g\). Train so a matching (image, caption) pair has high similarity, and mismatched pairs low:
maximize \(\;f(\text{img})^\top g(\text{text})\;\) for true pairs
Beyond compressing data, we can learn its distribution \(p(x)\) and sample new examples. Four main families:
Predict the next piece given the past: \(p(x)=\prod_t p(x_t \mid x_{<t})\). Powers LLMs.
An autoencoder with a probabilistic bottleneck; sample the code, then decode.
A generator vs. a discriminator. Sharp samples, but tricky to train — now largely superseded.
Learn to denoise; today's state of the art for images & video.
We go deep on diffusion & multimodal generation in L23.
L18: back to supervised — decision trees & ensembles.