Lecture 18
RN 19.3, 19.8 · PM 7.3.1, 7.5
🎥 This lecture is pre-recorded — I'm away at ICML this week, so there's no in-person class today. Watch at your own pace.
📅 Due today (Thu Jul 9): Assignment 2 (and Chat 9 is now out).
💬 Questions? Post on Piazza — the TAs are available, and I'll follow up when I'm back.
A decision tree is nothing more than a nested if-then-else — learned automatically from data.
Simple to read, fast to evaluate, and surprisingly effective in practice.
Jeeves wants to predict whether Bertie will play tennis today. He has been recording the weather and Bertie's behavior every morning for 14 days.
14-day training set + 14-day held-out test set. Goal: build a classifier from training, evaluate on test.
| Day | Outlook | Temp | Humidity | Wind | Tennis? |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 3 | Overcast | Hot | High | Weak | Yes |
| 4 | Rain | Mild | High | Weak | Yes |
| 5 | Rain | Cool | Normal | Weak | Yes |
| 6 | Rain | Cool | Normal | Strong | No |
| 7 | Overcast | Cool | Normal | Strong | Yes |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 10 | Rain | Mild | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
| 12 | Overcast | Mild | High | Strong | Yes |
| 13 | Overcast | Hot | Normal | Weak | Yes |
| 14 | Rain | Mild | High | Strong | No |
9 Yes, 5 No. We'll learn a tree from these 14 rows.
Sketch only — we'll build the real tree shortly.
Given a new example, follow the tests from the root to a leaf:
Example: Outlook = Overcast, Temp = Hot, Humidity = High, Wind = Weak.
Test Outlook \(\to\) value is Overcast \(\to\) leaf Yes. Done.
Observation: the path itself is a nested if-then-else program. Converting a tree to code is straightforward.
Pick a feature-testing order; recursively split the training examples by that feature's value.
Each example flows down the tree according to its feature values; leaves end up with subsets of the training set.
Built from the 14 training rows by following the feature order above.
Every example at this node has the same label.
Return that label.
We've tested every feature, but examples still disagree (noisy data).
Return the majority class at this node.
A feature value never appeared in training data — no examples reach this leaf.
Return the majority class at the parent node.
"No features left" handles noisy labels; "no examples left" handles unseen feature combinations.
The only open question: which feature do we pick to test?
Pick the feature whose split makes the children most class-pure.
Overcast cleanly predicts Yes — one branch is pure!
Every Temp value still mixes Yes/No — messier.
We need a metric to score "purity". Enter entropy.
\(I\bigl(P(c_1), \ldots, P(c_k)\bigr) = -\sum_{i=1}^{k} P(c_i)\, \log_2 P(c_i)\)
Bits of uncertainty. Bigger = less certain.
If we split on a feature with \(k\) values, weighted-average the children's entropies:
\(H_{\text{before}} = I\!\left(\tfrac{p}{p+n},\, \tfrac{n}{p+n}\right)\)
\(H_{\text{after}} = \sum_{i=1}^{k} \dfrac{p_i + n_i}{p + n}\; I\!\left(\tfrac{p_i}{p_i+n_i},\, \tfrac{n_i}{p_i+n_i}\right)\)
\(IG = H_{\text{before}} - H_{\text{after}}\)
Pick the feature with the largest information gain at every node.
Training set has 9 Yes, 5 No \((p = 9,\ n = 5)\).
Q3. \(H_{\text{before}} = I(9/14, 5/14) = ?\)
\(\Rightarrow\) 0.940 bits.
Q4. Split on Outlook: Sunny (2+/3−), Overcast (4+/0−), Rain (3+/2−).
\(H_{\text{after}} = \tfrac{5}{14}(0.971) + \tfrac{4}{14}(0) + \tfrac{5}{14}(0.971) = 0.694\).
\(\Rightarrow IG(\text{Outlook}) = 0.940 - 0.694 = \mathbf{0.247}\).
Same \(H_{\text{before}} = 0.940\). Now try splitting on Humidity instead.
Q5. Split on Humidity: High (3+/4−), Normal (6+/1−).
\(H_{\text{after}} = \tfrac{7}{14}(0.985) + \tfrac{7}{14}(0.591) = 0.788\).
\(\Rightarrow IG(\text{Humidity}) = 0.940 - 0.788 = \mathbf{0.151}\).
Q6. Which feature do we pick as the root?
Outlook — higher IG: \(0.247 > 0.151\).
Greedy: pick the locally best feature at each step. Not always globally optimal — but fast and effective.
Information gain uses entropy. A popular alternative measures how often a random example would be mislabeled if labeled by the node's class distribution:
\(G = 1 - \displaystyle\sum_{i=1}^{k} P(c_i)^2\)
\(0\) when pure; maximal when classes are balanced — same spirit as entropy, cheaper to compute (no logs).
The tree learner behind scikit-learn: binary splits, Gini (or entropy) for classification, squared error for regression.
Entropy vs. Gini rarely changes the tree much — both reward pure children.
Grown to purity, a tree memorizes the training set (one leaf per noisy example) and generalizes poorly — exactly the overfitting from L16.
Stop growing early: cap max depth, require a minimum #examples per node, or a minimum gain to split.
Grow the full tree, then collapse branches that don't improve validation accuracy.
Pick the depth / pruning strength with cross-validation (L16).
One deep tree is high-variance. Bagging averages many trees, each trained on a bootstrap sample; a random forest also splits on a random feature subset to decorrelate them.
Instead of averaging independent trees, add trees in sequence, each one correcting the residual errors of the trees so far:
\(F_m(x) = F_{m-1}(x) + \gamma\, h_m(x)\)
Forests reduce variance (parallel); boosting reduces bias (sequential).
L19: from hand-engineered trees to neural networks — the building block of deep learning.