Lecture 6
RN 12.2–12.3 · PM 8.1
Calculate prior, posterior, and joint probabilities using:
An agent must reason about uncertainty and decide anyway.
The best it can do is know how uncertain it is and act accordingly.
Probability is the formal measure of uncertainty — but there are two camps.
Probability is objective — the long-run frequency of an event.
After \(n\) heads and \(m\) tails: \(\;P(\text{H}) = \frac{n}{n + m}\).
Can't decide without data.
Probability is a degree of belief: start with a prior, update on evidence.
Laplace prior of 1H/1T: \(\;P(\text{H}) = \frac{1 + n}{2 + n + m}\).
Can decide with no data using an uninformed prior.
After 2 heads and 4 tails: Bayesian \(= \tfrac{3}{8}\). Frequentist \(= \tfrac{2}{6}\).
\(0 < P(A) < 1\) doesn't mean \(A\) is "partly true" — it means we're uncertain. Probability is a measure of ignorance.
Example: \(P(\text{weather}, \text{temperature})\)
| Hot | Mild | Cold | |
|---|---|---|---|
| Sunny | 0.10 | 0.20 | 0.10 |
| Cloudy | 0.05 | 0.35 | 0.20 |
Probabilities sum to 1 across the whole table.
Mr. Holmes has a burglar alarm and two flaky neighbours (Watson, Gibbon) who call him when they hear it. The alarm also misfires on earthquakes; earthquakes are reported on the radio.
Random variables
6 Boolean variables ⇒ \(2^6 = 64\) joint probabilities.
Given a joint distribution, we can compute the probability over a subset of variables by summing out the rest.
From \(P(A, B, C)\) to \(P(A, B)\): sum out \(C\):
\(P(A \wedge B) = P(A \wedge B \wedge C) + P(A \wedge B \wedge \neg C)\)
From \(P(A, B)\) to \(P(A)\): sum out \(B\):
\(P(A) = P(A \wedge B) + P(A \wedge \neg B)\)
Add up all probabilities while varying the variables we don't care about.
\(P(A, W, G)\) over Watson \(W\), Gibbon \(G\), and the alarm \(A\):
| \(A\) | \(\neg A\) | |||
|---|---|---|---|---|
| \(G\) | \(\neg G\) | \(G\) | \(\neg G\) | |
| \(W\) | 0.032 | 0.048 | 0.036 | 0.324 |
| \(\neg W\) | 0.008 | 0.012 | 0.054 | 0.486 |
Example — \(P(\neg A \wedge W)\): sum out \(G\) within the highlighted cells.
\(P(\neg A \wedge W) = 0.036 + 0.324 = \mathbf{0.36}\)
Q1. \(P(\neg A \wedge W)\)?
Q2. \(P(A \wedge \neg G)\)?
Q3. \(P(\neg A)\)?
| \(A\) | \(\neg A\) | |||
|---|---|---|---|---|
| \(G\) | \(\neg G\) | \(G\) | \(\neg G\) | |
| \(W\) | 0.032 | 0.048 | 0.036 | 0.324 |
| \(\neg W\) | 0.008 | 0.012 | 0.054 | 0.486 |
The conditional probability of \(A\) given \(B\) is the fraction of the \(B\)-world in which \(A\) also holds:
\(P(A | B) = \dfrac{P(A \wedge B)}{P(B)}\)
From the previous slides: \(P(\neg A \wedge W) = 0.36\), \(P(\neg A) = 0.9\).
Q. Watson calls, given the alarm is NOT going.
From the previous slides: \(P(A \wedge \neg G) = 0.06\), \(P(\neg A) = 0.9\) (so \(P(A) = 0.1\)).
Q. Gibbon does NOT call, given the alarm is going.
The product rule, repeated:
Two variables: \(P(A \wedge B) = P(A | B)\, P(B)\)
Three variables: \(P(A \wedge B \wedge C) = P(A | B \wedge C)\, P(B | C)\, P(C)\)
In general:
\[ P(X_1 \wedge X_2 \wedge \cdots \wedge X_n) = \prod_{i=1}^{n} P(X_i \,|\, X_1 \wedge \cdots \wedge X_{i-1}) \]
Pick any ordering of variables; condition each on its predecessors.
Given: \(P(A) = 0.1\), \(P(W|A) = 0.9\), \(P(G|A \wedge W) = 0.3\).
Q. Apply the chain rule to the conjunction.
Given: \(P(A) = 0.1\), \(P(W|\neg A) = 0.4\), \(P(G|\neg A \wedge \neg W) = 0.1\).
Q. Apply the chain rule, this time to the all-negated conjunction.
Models give us causal knowledge. Diagnosis needs evidential reasoning.
We know \(P(\text{symptom} | \text{disease})\) and want \(P(\text{disease} | \text{symptom})\).
You don't memorize it — you derive it from the product rule:
By the product rule (both directions): \(P(X \wedge Y) = P(X | Y)\, P(Y) = P(Y | X)\, P(X)\).
Divide both sides by \(P(Y)\): \(P(X | Y) = \dfrac{P(Y | X)\, P(X)}{P(Y)}\).
The denominator \(P(Y)\) is just a normalization constant — compute \(P(X | Y)\) and \(P(\neg X | Y)\), then normalize.
Given: \(P(A) = 0.1\), \(P(W|A) = 0.9\), \(P(W|\neg A) = 0.4\).
Q. Alarm NOT going, given Watson calls.
Given: \(P(A) = 0.1\), \(P(G|A) = 0.3\), \(P(G|\neg A) = 0.1\).
Q. Alarm going, given Gibbon does NOT call.
Any conditional probability can be calculated by combining three rules.
Given: \(P(A) = 0.6\); \(P(B | A) = 0.4\), \(P(\neg B | \neg A) = 0.2\); \(P(C | A \wedge B) = 0.1\), \(P(C | \neg A \wedge B) = 0.2\), \(P(C | A \wedge \neg B) = 0.5\), \(P(C | \neg A \wedge \neg B) = 0.8\).
Step 1. Convert the conditional into a ratio of joints:
\(P(A | C) = \dfrac{P(A \wedge C)}{P(C)} = \dfrac{P(A \wedge C)}{P(A \wedge C) + P(\neg A \wedge C)}\)
From step 1: \(P(A | C) = \dfrac{P(A \wedge C)}{P(A \wedge C) + P(\neg A \wedge C)}\). The numerator and denominator are partial joints — sum out the missing variable \(B\).
Step 2. Sum out \(B\):
\(P(A \wedge C) = P(A \wedge B \wedge C) + P(A \wedge \neg B \wedge C)\)
\(P(\neg A \wedge C) = P(\neg A \wedge B \wedge C) + P(\neg A \wedge \neg B \wedge C)\)
Now every term on the right is a full-joint probability, ready for the chain rule.
Apply the chain rule \(P(A \wedge B \wedge C) = P(C | A \wedge B)\, P(B | A)\, P(A)\) to each full joint.
\(P(A \wedge B \wedge C) = 0.1 \times 0.4 \times 0.6 = 0.024\)
\(P(A \wedge \neg B \wedge C) = 0.5 \times 0.6 \times 0.6 = 0.180\)
\(P(\neg A \wedge B \wedge C) = 0.2 \times 0.8 \times 0.4 = 0.064\)
\(P(\neg A \wedge \neg B \wedge C) = 0.8 \times 0.2 \times 0.4 = 0.064\)
\(P(A | C) = \dfrac{0.024 + 0.180}{0.024 + 0.180 + 0.064 + 0.064} = \dfrac{0.204}{0.332} \approx \mathbf{0.614}\)
| Rule | Formula | Use when |
|---|---|---|
| Sum | \(P(A) = \sum_b P(A \wedge B{=}b)\) | marginalizing out variables |
| Product | \(P(A \wedge B) = P(A | B)\, P(B)\) | defining or applying a conditional |
| Chain | \(P(X_1 \wedge \cdots \wedge X_n) = \prod_i P(X_i | X_1 \wedge \cdots \wedge X_{i-1})\) | computing a full joint from conditionals |
| Bayes' | \(P(X | Y) = \dfrac{P(Y | X)\, P(X)}{P(Y)}\) | flipping causal → evidential |
Calculate prior, posterior, and joint probabilities using:
The chain rule gives a full joint — but \(2^n\) probabilities for \(n\) variables is too many. Conditional independence and Bayesian networks let us write huge joints with far fewer numbers.