Course page PDF
CS 486/686
Unsupervised & Representation Learning

Lecture 17

RN 21.7.1 · PM 10.2 · GBC 14.1

Recorded lecture (asynchronous)

🎥 This lecture is pre-recorded — I'm away at ICML this week, so there's no in-person class today. Watch at your own pace.

📅 Due today (Tue Jul 7): Chat 8 and the CS686 project proposal.

💬 Questions? Post on Piazza — the TAs are available, and I'll follow up when I'm back.

Search Uncertainty Decisions Learning

Learning goals

  • Understand what unsupervised learning is.
  • Trace the k-means clustering algorithm.
  • Perform PCA: mean-center, covariance, eigendecomposition.
  • Explain autoencoders and self-supervised / contrastive representations.

Two big unsupervised tasks

No labels — only the raw data \(\{x_i\}\). What can we learn?

Representation learning

Find a low-dimensional summary of each example.

Clustering, PCA, autoencoders.

Generative modelling

Learn the data distribution so we can sample new examples.

GANs, VAEs, diffusion.

Today: clustering & PCA, then how modern models learn representations without labels.

Clustering: hard vs soft

Group training examples into clusters, treating them like discovered categories.

Hard clustering

Each \(x\) gets exactly one cluster.

\(\text{class}(x) = c\)

Soft clustering

Each \(x\) gets a distribution over clusters.

\(\text{class}(x) \sim P(C \mid x)\)

Each color is a discovered cluster — no labels were given.

k-Means: centroids and distance

Inputs: number of clusters \(k\) and \(m\) training examples \(x_i \in \mathbb{R}^n\). Each cluster has a centroid \(c \in \mathbb{R}^n\); each \(x_i\) is assigned to its closest centroid.

Euclidean distance (L2)

\(d(c, x) = \sqrt{\sum_{j = 1}^{n} (c_j - x_j)^2}\)

  • Hard clustering: each example goes to one centroid.
  • Centroids live in the same feature space as the data.

Triangles = current centroids; circles = data points.

The k-means algorithm

Alternate two steps until convergence:

k-means(\(X,\ k\))
  1. Initialize \(k\) centroids \(C[0\ldots k-1]\) randomly.
  2. Repeat until assignments stop changing:
    • Assign step: for each example \(i\), \(\ y[i] \leftarrow \arg\min_{c}\, d\!\bigl(C[c],\, X[i]\bigr)\).
    • Update step: for each cluster \(c\), \(\ C[c] \leftarrow\) mean of all \(X[i]\) with \(y[i] = c\).
  • Guaranteed to converge with L2 distance.
  • Not guaranteed to find the global optimum.
  • Mitigate: random restarts, feature scaling.

k-Means in action

Same 12 points, three snapshots: random init → mid-iteration → converged.

Iteration 0 (random init)
Iteration 2 (mid)
Converged

Centroids drift; assignments stabilize; cost decreases monotonically.

One iteration: a numeric walk-through

\(k = 2\), Euclidean (squared) distance. Data + initial centroids:

example\(x_1\)\(x_2\)\(x_3\)
10.20.50
2−0.62.11.2
3−0.51.91.3
40.10.5−0.3

\(c_1 = (0.3, 0.8, -0.5), \;\; c_2 = (-0.1, -0.5, 1.0)\).

Assign step (squared distance to each centroid):

ex 1: d²(c1) = 0.35, d²(c2) = 2.09 → c1
ex 2: d²(c1) = 5.39, d²(c2) = 7.05 → c1
ex 3: d²(c1) = 5.09, d²(c2) = 6.01 → c1
ex 4: d²(c1) = 0.17, d²(c2) = 2.73 → c1

Update step:

\(c_1' = (-0.2, 1.25, 0.55), \;\; c_2' = \varnothing\)

Watch out: \(c_2\) got no examples. Re-initialize it (random restart).

How do we choose \(k\)?

Larger \(k\) always gives lower training error — but defeats the point of compression.

Elbow method

  1. Run k-means for \(k \in \{1, \ldots, k_{\max}\}\).
  2. Plot average within-cluster distance.
  3. Pick the "elbow" where error stops dropping fast.
k error k* = 3

Silhouette score

\(s(x) = \dfrac{b(x) - a(x)}{\max(a(x),\, b(x))}\)

  • \(a(x)\) = mean distance to own cluster.
  • \(b(x)\) = mean distance to nearest other cluster.
  • Pick \(k\) maximizing average \(s\).
k s(x) k* = 3

Dimension reduction: the manifold idea

High-dimensional data often lives on a low-dimensional manifold. Extra dimensions are mostly noise.

x y z

Goal: recover the intrinsic 1D coordinate along the blue ribbon.

Principal Component Analysis (PCA)

The PCA recipe

Find orthogonal axes that successively maximize the variance of the projected data.

  • First PC = direction of largest variance.
  • Second PC = direction orthogonal to first that maximizes remaining variance.
  • Project onto the top \(k\) PCs for a compact representation.
PC1 PC2

How to compute PCA

PCA(\(X,\ k\))
  1. Mean-center the data: \(x \leftarrow x - \text{mean}(x)\).
  2. Compute the covariance matrix \(\Sigma = \tfrac{1}{m}\, X^\top X\).
  3. Eigendecompose \(\Sigma\): \(\Sigma\, v_k = \lambda_k\, v_k\). Eigenvectors \(v_k\) are the principal components, sorted by \(\lambda_1 \ge \lambda_2 \ge \ldots\)
  4. Project onto the top \(k\) PCs: \(z = V_{\text{top-}k}^\top\, x\).

Fraction of variance kept by the \(k\)-th PC:  \(\dfrac{\lambda_k}{\sum_j \lambda_j}\). Cumulative across top \(k\) PCs guides how many to keep.

Autoencoders: encoder + decoder

A neural net learns to squeeze and reconstruct its input.

x e(x) encoder z = e(x) code d(z) x̂ = d(e(x)) decoder Loss: E = ∑i (xi − d(e(xi)))² Encoder + decoder are trained jointly so that the bottleneck z keeps enough info to rebuild x.

Linear vs deep autoencoders

Linear autoencoder

Single linear layer; weights shared between encoder and decoder.

\(\hat{z} = W x, \quad \hat{x} = W^\top \hat{z}\)

With squared loss, the solution recovers the same subspace as PCA.

Deep neural autoencoder

Stack non-linear layers (ReLU, etc.) in both encoder and decoder; train end-to-end with backprop.

Captures non-linear manifolds — a building block for the learned representations below.

Self-supervised & contrastive learning

Labels are expensive. Self-supervised learning invents the supervision signal from the data itself, then learns a reusable representation.

image "a dog" encoders pull together

CLIP: an image encoder \(f\) and a text encoder \(g\). Train so a matching (image, caption) pair has high similarity, and mismatched pairs low:

maximize \(\;f(\text{img})^\top g(\text{text})\;\) for true pairs

  • No human labels — the pairing is the signal.
  • The learned embeddings transfer to many tasks (even zero-shot).

Generative modelling: a quick map

Beyond compressing data, we can learn its distribution \(p(x)\) and sample new examples. Four main families:

Autoregressive

Predict the next piece given the past: \(p(x)=\prod_t p(x_t \mid x_{<t})\). Powers LLMs.

VAE

An autoencoder with a probabilistic bottleneck; sample the code, then decode.

GAN

A generator vs. a discriminator. Sharp samples, but tricky to train — now largely superseded.

Diffusion

Learn to denoise; today's state of the art for images & video.

We go deep on diffusion & multimodal generation in L23.

Learning goals (recap) — Next: decision trees

  • ✓  Understand what unsupervised learning is.
  • ✓  Trace the k-means clustering algorithm.
  • ✓  Perform PCA by mean-centering, covariance, and eigendecomposition.
  • ✓  Explain autoencoders and self-supervised / contrastive representations.

L18: back to supervised — decision trees & ensembles.