Lecture 16
RN 19.1–19.2, 19.6 · PM 7.1–7.3
🎥 I'm away at ICML next week, so L17 (Tue Jul 7) and L18 (Thu Jul 9) are pre-recorded videos — no in-person class those days. Watch on your own time (links posted before each class).
📅 Deadlines are unchanged: Chat 8 + the CS686 project proposal are due Tue Jul 7; Chat 9 is out and Assignment 2 is due Thu Jul 9.
💬 Questions? Post on Piazza — the TAs are available all week, and I'll follow up when I'm back.
Learning = improving on future tasks based on experience (data).
Given inputs + targets, predict targets for new inputs.
Today's focus.
No targets — find structure: clustering, representations.
L17.
Learn from rewards — what to do.
L15 (done).
Q1. We have a user's credit-card transactions and want to flag any that look different from the rest. We have no fraud labels.
Q2. We have historical weather labels (sunny/cloudy/rain/snow) for a date, and we want to predict the same date next year.
Target is discrete — e.g. dog vs cat.
Target is continuous — e.g. tomorrow's temperature.
Q3. Predict next year's weather label (sunny/rain/...) for a date.
Q4. Predict next month's price of a house.
1. Model. Pick a function \(h_{\mathbf{w}}\) with tunable parameters \(\mathbf{w}\) (weights).
2. Loss. Measure how wrong \(h_{\mathbf{w}}\) is: \(L(\mathbf{w}) = \tfrac{1}{m}\sum_{i} \ell\bigl(h_{\mathbf{w}}(x_i),\, y_i\bigr)\).
3. Fit. Choose \(\mathbf{w}\) that makes \(L(\mathbf{w})\) small — usually by gradient descent.
Change the model and the loss → you get every method in this module (regression → neural nets → transformers).
One real input \(x\), one real output:
\(h_{\mathbf{w}}(x) = w\,x + b\)
With \(n\) features \(\mathbf{x}=(x_1,\dots,x_n)\):
\(h_{\mathbf{w}}(\mathbf{x}) = \mathbf{w}^\top \mathbf{x} + b\)
Parameters to learn: the slope(s) \(\mathbf{w}\) and intercept \(b\).
Loss = mean squared error between prediction and target:
\(L(\mathbf{w}, b) = \dfrac{1}{m}\displaystyle\sum_{i=1}^{m}\bigl(h_{\mathbf{w}}(x_i) - y_i\bigr)^2\)
\(L\) is convex — set \(\nabla L = 0\) and solve. For features \(X\): \(\mathbf{w}^\star = (X^\top X)^{-1} X^\top \mathbf{y}\).
Works for any differentiable model. Repeat:
\(\mathbf{w} \leftarrow \mathbf{w} - \eta\, \nabla_{\mathbf{w}} L\)
Gradient descent is the workhorse for the rest of the module — the models just get bigger.
Squash the linear score into a probability with the sigmoid:
\(\hat{y} = \sigma(z) = \dfrac{1}{1 + e^{-z}}, \; z = \mathbf{w}^\top\mathbf{x}+b\)
Squared error is a poor fit for probabilities. Instead, maximize the probability of the true labels — equivalently, minimize binary cross-entropy:
\(L(\mathbf{w}) = -\dfrac{1}{m}\displaystyle\sum_{i=1}^{m}\Bigl[\, y_i \log \hat{y}_i + (1 - y_i)\log(1 - \hat{y}_i)\,\Bigr]\)
True label \(y=1\): loss is \(-\log \hat{y}\) — small when \(\hat{y}\to 1\), huge when \(\hat{y}\to 0\).
Confident and wrong is punished the hardest — exactly the behaviour we want.
Output one score \(z_k\) per class; softmax turns scores into a probability distribution:
\(P(y{=}k \mid \mathbf{x}) = \dfrac{e^{z_k}}{\sum_j e^{z_j}}, \qquad L = -\log P(y{=}\text{true class})\)
This softmax + cross-entropy is exactly the output layer we'll reuse for neural nets (L20).
Same 9 points, increasingly flexible models. The wiggliest one fits training data best — but would you trust it on a new \(x\)?
Even with infinite data, how far off is the model family? High bias = too simple, underfits.
How much does the fit swing with a different training set? High variance = too flexible, overfits.
Dots = predictions from re-trained models. Bullseye = truth.
How to pick the right complexity? Use a slice of training data as a surrogate test set.
Training error keeps dropping; validation error turns back up — that gap is overfitting.
Add a penalty on large weights to the loss:
\(L(\mathbf{w}) + \lambda \lVert \mathbf{w}\rVert^2\)
\(\lambda\) trades bias against variance; tune it with cross-validation.
The classical U-curve isn't the whole story.
Today, labels guided every hypothesis. Without labels, what structure can we still find in data?
L17: clustering (k-means), PCA, autoencoders, and learned representations.