Lecture 18
RN 19.3 · PM 7.3.1
A decision tree is nothing more than a nested if-then-else — learned automatically from data.
Simple to read, fast to evaluate, and surprisingly effective in practice.
Jeeves wants to predict whether Bertie will play tennis today. He has been recording the weather and Bertie's behavior every morning for 14 days.
14-day training set + 14-day held-out test set. Goal: build a classifier from training, evaluate on test.
| Day | Outlook | Temp | Humidity | Wind | Tennis? |
|---|---|---|---|---|---|
| 1 | Sunny | Hot | High | Weak | No |
| 2 | Sunny | Hot | High | Strong | No |
| 3 | Overcast | Hot | High | Weak | Yes |
| 4 | Rain | Mild | High | Weak | Yes |
| 5 | Rain | Cool | Normal | Weak | Yes |
| 6 | Rain | Cool | Normal | Strong | No |
| 7 | Overcast | Cool | Normal | Strong | Yes |
| 8 | Sunny | Mild | High | Weak | No |
| 9 | Sunny | Cool | Normal | Weak | Yes |
| 10 | Rain | Mild | Normal | Weak | Yes |
| 11 | Sunny | Mild | Normal | Strong | Yes |
| 12 | Overcast | Mild | High | Strong | Yes |
| 13 | Overcast | Hot | Normal | Weak | Yes |
| 14 | Rain | Mild | High | Strong | No |
9 Yes, 5 No. We'll learn a tree from these 14 rows.
Sketch only — we'll build the real tree shortly.
Given a new example, follow the tests from the root to a leaf:
Example: Outlook = Overcast, Temp = Hot, Humidity = High, Wind = Weak.
Test Outlook \(\to\) value is Overcast \(\to\) leaf Yes. Done.
Observation: the path itself is a nested if-then-else program. Converting a tree to code is straightforward.
Pick a feature-testing order; recursively split the training examples by that feature's value.
Each example flows down the tree according to its feature values; leaves end up with subsets of the training set.
Built from the 14 training rows by following the feature order above.
Every example at this node has the same label.
Return that label.
We've tested every feature, but examples still disagree (noisy data).
Return the majority class at this node.
A feature value never appeared in training data — no examples reach this leaf.
Return the majority class at the parent node.
"No features left" handles noisy labels; "no examples left" handles unseen feature combinations.
The only open question: which feature do we pick to test?
Pick the feature whose split makes the children most class-pure.
Overcast cleanly predicts Yes — one branch is pure!
Every Temp value still mixes Yes/No — messier.
We need a metric to score "purity". Enter entropy.
\(I\bigl(P(c_1), \ldots, P(c_k)\bigr) = -\sum_{i=1}^{k} P(c_i)\, \log_2 P(c_i)\)
Bits of uncertainty. Bigger = less certain.
If we split on a feature with \(k\) values, weighted-average the children's entropies:
\(H_{\text{before}} = I\!\left(\tfrac{p}{p+n},\, \tfrac{n}{p+n}\right)\)
\(H_{\text{after}} = \sum_{i=1}^{k} \dfrac{p_i + n_i}{p + n}\; I\!\left(\tfrac{p_i}{p_i+n_i},\, \tfrac{n_i}{p_i+n_i}\right)\)
\(IG = H_{\text{before}} - H_{\text{after}}\)
Pick the feature with the largest information gain at every node.
Training set has 9 Yes, 5 No \((p = 9,\ n = 5)\).
Q3. \(H_{\text{before}} = I(9/14, 5/14) = ?\)
\(\Rightarrow\) 0.940 bits.
Q4. Split on Outlook: Sunny (2+/3−), Overcast (4+/0−), Rain (3+/2−).
\(H_{\text{after}} = \tfrac{5}{14}(0.971) + \tfrac{4}{14}(0) + \tfrac{5}{14}(0.971) = 0.694\).
\(\Rightarrow IG(\text{Outlook}) = 0.940 - 0.694 = \mathbf{0.247}\).
Same \(H_{\text{before}} = 0.940\). Now try splitting on Humidity instead.
Q5. Split on Humidity: High (3+/4−), Normal (6+/1−).
\(H_{\text{after}} = \tfrac{7}{14}(0.985) + \tfrac{7}{14}(0.591) = 0.788\).
\(\Rightarrow IG(\text{Humidity}) = 0.940 - 0.788 = \mathbf{0.151}\).
Q6. Which feature do we pick as the root?
Outlook — higher IG: \(0.247 > 0.151\).
Greedy: pick the locally best feature at each step. Not always globally optimal — but fast and effective.
L19: from hand-engineered trees to neural networks — the building block of deep learning.