Understand the basic idea of autoencoders and GANs.
Two big unsupervised tasks
No labels — only the raw data \(\{x_i\}\). What can we learn?
Representation learning
Find a low-dimensional summary of each example.
Clustering, PCA, autoencoders.
Generative modelling
Learn the data distribution so we can sample new examples.
GANs, VAEs, diffusion.
Today: one task from each, plus two neural variants.
Clustering: hard vs soft
Group training examples into clusters, treating them like discovered categories.
Hard clustering
Each \(x\) gets exactly one cluster.
\(\text{class}(x) = c\)
Soft clustering
Each \(x\) gets a distribution over clusters.
\(\text{class}(x) \sim P(C \mid x)\)
Each color is a discovered cluster — no labels were given.
k-Means: centroids and distance
Inputs: number of clusters \(k\) and \(m\) training examples \(x_i \in \mathbb{R}^n\). Each cluster has a centroid \(c \in \mathbb{R}^n\); each \(x_i\) is assigned to its closest centroid.
\(b(x)\) = mean distance to nearest other cluster.
Pick \(k\) maximizing average \(s\).
Dimension reduction: the manifold idea
High-dimensional data often lives on a low-dimensional manifold. Extra dimensions are mostly noise.
Goal: recover the intrinsic 1D coordinate along the blue ribbon.
Principal Component Analysis (PCA)
The PCA recipe
Find orthogonal axes that successively maximize the variance of the projected data.
First PC = direction of largest variance.
Second PC = direction orthogonal to first that maximizes remaining variance.
Project onto the top \(k\) PCs for a compact representation.
How to compute PCA
PCA(\(X,\ k\))
Mean-center the data: \(x \leftarrow x - \text{mean}(x)\).
Compute the covariance matrix \(\Sigma = \tfrac{1}{m}\, X^\top X\).
Eigendecompose \(\Sigma\): \(\Sigma\, v_k = \lambda_k\, v_k\). Eigenvectors \(v_k\) are the principal components, sorted by \(\lambda_1 \ge \lambda_2 \ge \ldots\)
Project onto the top \(k\) PCs: \(z = V_{\text{top-}k}^\top\, x\).
Fraction of variance kept by the \(k\)-th PC: \(\dfrac{\lambda_k}{\sum_j \lambda_j}\). Cumulative across top \(k\) PCs guides how many to keep.
Autoencoders: encoder + decoder
A neural net learns to squeeze and reconstruct its input.
Linear vs deep autoencoders
Linear autoencoder
Single linear layer; weights shared between encoder and decoder.
\(\hat{z} = W x, \quad \hat{x} = W^\top \hat{z}\)
With squared loss, the solution recovers the same subspace as PCA.
Deep neural autoencoder
Stack non-linear layers (ReLU, etc.) in both encoder and decoder; train end-to-end with backprop.
Captures non-linear manifolds — the modern workhorse for representation learning.
Generative Adversarial Networks (GANs)
Two networks play a game: a generator tries to fool a discriminator.
Generator \(g(z)\): turns Gaussian noise \(z\) into a synthetic example that should look real.
Discriminator \(d(x)\): binary classifier — was \(x\) drawn from training data, or from \(g\)?
GAN training: a minimax game
Both networks optimize the same value, in opposite directions: