Lecture 23
VAEs · diffusion · text-to-image · CLIP
Next-token \(\prod_t p(x_t\mid x_{<t})\). Powers LLMs (L22); also images/audio.
Generator vs discriminator. Sharp but unstable — now largely legacy.
Probabilistic autoencoder; sample a latent code, decode it.
Learn to denoise. Today's state of the art for images & video.
L22 did autoregressive text. Today: VAEs → diffusion → multimodal.
An autoencoder (L17) with a probabilistic bottleneck: the encoder outputs a distribution \(q_\phi(z\mid x)\); the decoder \(p_\theta(x\mid z)\) reconstructs. Train by maximizing the ELBO:
\(\log p(x) \;\ge\; \underbrace{\mathbb{E}_{q_\phi(z\mid x)}\!\big[\log p_\theta(x\mid z)\big]}_{\text{reconstruction}} \;-\; \underbrace{\mathrm{KL}\!\big(q_\phi(z\mid x)\,\|\,p(z)\big)}_{\text{stay near prior } \mathcal{N}(0,I)}\)
Slowly add Gaussian noise until an image becomes pure static (forward). Then train a network to undo one step of noise (reverse).
Add a little Gaussian noise at each of \(T\) steps, with a schedule \(\beta_1,\ldots,\beta_T\):
\(q(x_t \mid x_{t-1}) = \mathcal{N}\!\big(x_t;\ \sqrt{1-\beta_t}\,x_{t-1},\ \beta_t I\big)\)
Conveniently, we can jump to any step in closed form (let \(\bar\alpha_t = \prod_{s\le t}(1-\beta_s)\)):
\(x_t = \sqrt{\bar\alpha_t}\,x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon, \qquad \varepsilon\sim\mathcal{N}(0,I)\)
No learning here — the forward process is fixed. As \(t\to T\), \(x_t\) becomes pure noise.
Train a network \(\varepsilon_\theta(x_t, t)\) to predict the noise that was added. Since \(x_t = \sqrt{\bar\alpha_t}x_0 + \sqrt{1-\bar\alpha_t}\,\varepsilon\), knowing \(\varepsilon\) lets us step back toward \(x_0\).
\(\mathcal{L} = \mathbb{E}_{x_0,\,\varepsilon,\,t}\Big[\ \big\lVert \varepsilon - \varepsilon_\theta(x_t, t)\big\rVert^2\ \Big]\)
DDPM takes many steps (slow, high quality); DDIM and distillation cut it to a few.
Condition the denoiser on a prompt \(c\) (e.g. text): \(\varepsilon_\theta(x_t, t, c)\). To make it follow the prompt more strongly, use classifier-free guidance:
\(\tilde\varepsilon = \varepsilon_\theta(x_t,t,\varnothing) + w\,\big(\varepsilon_\theta(x_t,t,c) - \varepsilon_\theta(x_t,t,\varnothing)\big)\)
Pixels are huge, so run diffusion in a compressed latent space (a VAE encodes/decodes), conditioned on a text embedding — this is Stable Diffusion.
Text conditions the denoiser via cross-attention (L21). Latent space makes it fast enough for one GPU.
To connect text and images, train an image encoder \(f\) and text encoder \(g\) so matching pairs land close together (contrastive objective, L17):
maximize \(f(\text{image})^\top g(\text{text})\) for true pairs, minimize it for mismatched pairs
Push generation further: instead of one image, generate the next screen of an operating system given your mouse and keyboard input — a generative model of an entire UI.
L24: tie the whole course together — recap & exam prep.