Lecture 16
RN 19.1–19.2 · PM 7.1–7.2
Learning = the ability to improve on future tasks based on experience.
Every learning system is defined by four ingredients:
The behaviour we want to improve (classify, predict, decide).
Experience used to improve (examples, demonstrations, rewards).
What we assume up front (model class, smoothness, priors).
How we score progress (accuracy, speed, new abilities).
Given input + target features, predict targets for new inputs.
Today's focus.
No targets — find structure: clustering, dimensionality reduction.
L17.
Learn from rewards and punishments — what to do.
L15.
Q1. We have a user's credit-card transactions and want to flag any that look different from the rest. We have no fraud labels.
Q2. We have historical weather labels (sunny/cloudy/rain/snow) for a date, and we want to predict the same date next year.
Target features are discrete — e.g. dog vs cat.
Target features are continuous — e.g. tomorrow's temperature.
Q3. Historical weather labels (sunny / cloudy / rain / snow); predict next year's label for a given date.
Q4. Historical house prices over time; predict next month's price for a particular house.
Given training examples \((x_i, f(x_i))\), find a hypothesis \(h\) that approximates the unknown true function \(f: X \to Y\).
A hypothesis space is the set of functions we'll consider (e.g. all polynomials of degree \(\leq k\)).
Learning = a search problem in the hypothesis space — usually some form of local search (gradient descent, etc.).
Same 9 data points, five candidate polynomial models. Watch how the fit changes with complexity.
To learn something useful, we must make assumptions. Those assumptions are the inductive bias of the learner.
With infinite data, how well can the family fit \(f\)? High bias = too simple, underfits.
How much does the learned hypothesis change with different training sets? High variance = too flexible, overfits.
Dots = predictions from re-trained models. Bullseye = truth.
How to pick a hypothesis that generalizes? Use a slice of training data as a surrogate test set.
After CV: pick the best hyperparameters and (optionally) retrain on all data.
Q5. Which is more likely to overfit?
Setup: domain \(X\), label space \(Y\), unknown true function \(f: X \to Y\). Examples drawn from distribution \(D\) over \(X\).
\(L_{D, f}(h) = \Pr_{x \sim D}\bigl[h(x) \neq f(x)\bigr]\)
The probability of being wrong on a fresh random example. We can't compute this — we don't know \(D\) or \(f\).
\(L_S(h) = \dfrac{|\{i \in [m]: h(x_i) \neq y_i\}|}{m}\)
Fraction misclassified on the training set \(S\) of size \(m\).
Hope: minimizing \(L_S\) will also keep \(L_{D, f}\) small — generalization.
Given a hypothesis class \(\mathcal{H}\) and a loss \(\ell\), pick the hypothesis with smallest training loss:
\(\hat{h} = \arg\min_{h \in \mathcal{H}} \dfrac{1}{m} \sum_{i = 1}^{m} \ell\bigl(h(x_i),\, y_i\bigr)\)
Three factors determine how well ERM generalizes:
Larger is better: more examples \(\Rightarrow\) closer empirical \(\to\) true.
Trade-off: small \(\mathcal{H}\) = high bias; large \(\mathcal{H}\) = high variance.
Encodes what kind of error you care about (0/1, squared, hinge, ...).
Today: labels guided every hypothesis we picked. Without labels, what structure can we still find in data?
L17: clustering, k-means, autoencoders.