Course page
CS 486/686
Probabilities

Lecture 6

RN 12.2–12.3 · PM 8.1

Search Uncertainty Decisions Learning

Learning goals

Calculate prior, posterior, and joint probabilities using:

  • the sum rule (marginalization),
  • the product rule (conditioning),
  • the chain rule (joint factorization),
  • and Bayes' rule (flipping a conditional).
  • Then put it all together with a universal 3-step recipe.

Why handle uncertainty?

  • An agent may not observe everything — it doesn't know exactly what state it is in.
  • An action may not have its intended consequence — it doesn't know exactly where it will end up.

An agent must reason about uncertainty and decide anyway.

The best it can do is know how uncertain it is and act accordingly.

Frequentists vs Bayesians

Probability is the formal measure of uncertainty — but there are two camps.

Frequentist

Probability is objective — the long-run frequency of an event.

After \(n\) heads and \(m\) tails: \(\;P(\text{H}) = \frac{n}{n + m}\).

Can't decide without data.

Bayesian

Probability is a degree of belief: start with a prior, update on evidence.

Laplace prior of 1H/1T: \(\;P(\text{H}) = \frac{1 + n}{2 + n + m}\).

Can decide with no data using an uninformed prior.

After 2 heads and 4 tails:  Bayesian \(= \tfrac{3}{8}\).   Frequentist \(= \tfrac{2}{6}\).

Random variables

  • A random variable has a domain of possible values and a probability distribution.
  • Example:  \(A = \) "the alarm is sounding".  Domain \(\{\text{true}, \text{false}\}\), with \(P(A = \text{true}) = 0.1\), \(P(A = \text{false}) = 0.9\).
  • Boolean shorthand:  \(P(A)\) means \(P(A = \text{true})\);  \(P(\neg A)\) means \(P(A = \text{false})\).

Axioms of probability

  • Every probability is in \([0, 1]\):  \(0 \le P(A) \le 1\).
  • Tautologies have probability 1; contradictions have probability 0:  \(P(\text{true}) = 1,\; P(\text{false}) = 0\).
  • Inclusion-exclusion:  \(P(A \vee B) = P(A) + P(B) - P(A \wedge B)\).
A B A∧B

\(0 < P(A) < 1\) doesn't mean \(A\) is "partly true" — it means we're uncertain. Probability is a measure of ignorance.

Joint probability distribution

  • A probabilistic model contains a set of random variables.
  • An atomic event assigns a value to every variable.
  • A joint probability distribution gives a probability to every atomic event.

Example: \(P(\text{weather}, \text{temperature})\)

HotMildCold
Sunny0.100.200.10
Cloudy0.050.350.20

Probabilities sum to 1 across the whole table.

Prior vs posterior — and our running example

  • Prior \(P(X)\): likelihood of \(X\) with no other info.
  • Posterior \(P(X | Y)\): likelihood of \(X\) given evidence \(Y\).

Mr. Holmes has a burglar alarm and two flaky neighbours (Watson, Gibbon) who call him when they hear it. The alarm also misfires on earthquakes; earthquakes are reported on the radio.

Random variables

  • \(B\): a burglary is happening
  • \(E\): an earthquake is happening
  • \(A\): the alarm is going
  • \(W\): Watson is calling
  • \(G\): Gibbon is calling
  • \(R\): radio reports an earthquake

6 Boolean variables ⇒ \(2^6 = 64\) joint probabilities.

B E R A W G

The sum rule (marginalization)

Given a joint distribution, we can compute the probability over a subset of variables by summing out the rest.

From \(P(A, B, C)\) to \(P(A, B)\): sum out \(C\):

\(P(A \wedge B) = P(A \wedge B \wedge C) + P(A \wedge B \wedge \neg C)\)

From \(P(A, B)\) to \(P(A)\): sum out \(B\):

\(P(A) = P(A \wedge B) + P(A \wedge \neg B)\)

Add up all probabilities while varying the variables we don't care about.

Marginalizing the Holmes joint

\(P(A, W, G)\) over Watson \(W\), Gibbon \(G\), and the alarm \(A\):

\(A\)\(\neg A\)
\(G\)\(\neg G\)\(G\)\(\neg G\)
\(W\) 0.032 0.048 0.036 0.324
\(\neg W\) 0.008 0.012 0.054 0.486

Example — \(P(\neg A \wedge W)\): sum out \(G\) within the highlighted cells.

\(P(\neg A \wedge W) = 0.036 + 0.324 = \mathbf{0.36}\)

Quiz pack: marginalization

Q1. \(P(\neg A \wedge W)\)?

\(= 0.036 + 0.324 = 0.36\).  (A)

Q2. \(P(A \wedge \neg G)\)?

\(= 0.048 + 0.012 = 0.06\).  (B)

Q3. \(P(\neg A)\)?

\(= 0.036 + 0.324 + 0.054 + 0.486 = 0.9\).  (E)
\(A\)\(\neg A\)
\(G\)\(\neg G\)\(G\)\(\neg G\)
\(W\)0.0320.0480.0360.324
\(\neg W\)0.0080.0120.0540.486

The product rule (conditioning)

The conditional probability of \(A\) given \(B\) is the fraction of the \(B\)-world in which \(A\) also holds:

\(P(A | B) = \dfrac{P(A \wedge B)}{P(B)}\)

all worlds  (total prob = 1) B A ∧ B \(P(A|B) =\) A ∧ B B

Q: \(P(W \mid \neg A)\)?

From the previous slides: \(P(\neg A \wedge W) = 0.36\), \(P(\neg A) = 0.9\).

Q. Watson calls, given the alarm is NOT going.

  1. \(0.2\)
  2. \(0.4\)
  3. \(0.6\)
  4. \(0.8\)
B. \(P(W | \neg A) = \dfrac{P(\neg A \wedge W)}{P(\neg A)} = \dfrac{0.36}{0.9} = \mathbf{0.4}\).

Q: \(P(\neg G \mid A)\)?

From the previous slides: \(P(A \wedge \neg G) = 0.06\), \(P(\neg A) = 0.9\) (so \(P(A) = 0.1\)).

Q. Gibbon does NOT call, given the alarm is going.

  1. \(0.2\)
  2. \(0.4\)
  3. \(0.6\)
  4. \(0.8\)
C. \(P(\neg G | A) = \dfrac{P(A \wedge \neg G)}{P(A)} = \dfrac{0.06}{0.1} = \mathbf{0.6}\).

The chain rule (joint factorization)

The product rule, repeated:

Two variables:  \(P(A \wedge B) = P(A | B)\, P(B)\)

Three variables:  \(P(A \wedge B \wedge C) = P(A | B \wedge C)\, P(B | C)\, P(C)\)

In general:

\[ P(X_1 \wedge X_2 \wedge \cdots \wedge X_n) = \prod_{i=1}^{n} P(X_i \,|\, X_1 \wedge \cdots \wedge X_{i-1}) \]

Pick any ordering of variables; condition each on its predecessors.

Q: \(P(A \wedge W \wedge \neg G)\)?

Given: \(P(A) = 0.1\), \(P(W|A) = 0.9\), \(P(G|A \wedge W) = 0.3\).

Q. Apply the chain rule to the conjunction.

  1. \(0.027\)
  2. \(0.037\)
  3. \(0.054\)
  4. \(0.063\)
D. \(P(A) \cdot P(W | A) \cdot P(\neg G | A \wedge W) = 0.1 \times 0.9 \times 0.7 = \mathbf{0.063}\).

Q: \(P(\neg A \wedge \neg W \wedge \neg G)\)?

Given: \(P(A) = 0.1\), \(P(W|\neg A) = 0.4\), \(P(G|\neg A \wedge \neg W) = 0.1\).

Q. Apply the chain rule, this time to the all-negated conjunction.

  1. \(0.486\)
  2. \(0.324\)
  3. \(0.216\)
  4. \(0.054\)
A. \(P(\neg A) \cdot P(\neg W | \neg A) \cdot P(\neg G | \neg A \wedge \neg W) = 0.9 \times 0.6 \times 0.9 = \mathbf{0.486}\).

Why flip conditionals?

Models give us causal knowledge. Diagnosis needs evidential reasoning.

disease → symptom fire → alarm symptom → disease alarm → fire Bayes' rule flips the conditional

We know \(P(\text{symptom} | \text{disease})\) and want \(P(\text{disease} | \text{symptom})\).

Bayes' rule

Bayes' rule.  \(P(X | Y) = \dfrac{P(Y | X)\, P(X)}{P(Y)}\)

You don't memorize it — you derive it from the product rule:

By the product rule (both directions):  \(P(X \wedge Y) = P(X | Y)\, P(Y) = P(Y | X)\, P(X)\).

Divide both sides by \(P(Y)\):  \(P(X | Y) = \dfrac{P(Y | X)\, P(X)}{P(Y)}\).

The denominator \(P(Y)\) is just a normalization constant — compute \(P(X | Y)\) and \(P(\neg X | Y)\), then normalize.

Q: \(P(\neg A \mid W)\)?

Given: \(P(A) = 0.1\), \(P(W|A) = 0.9\), \(P(W|\neg A) = 0.4\).

Q. Alarm NOT going, given Watson calls.

  1. \(0.2\)
  2. \(0.45\)
  3. \(0.8\)
  4. \(0.9\)
C. Marginalize: \(P(W) = P(A) P(W|A) + P(\neg A) P(W|\neg A) = 0.09 + 0.36 = 0.45\).
Bayes: \(P(\neg A | W) = \dfrac{P(\neg A) P(W|\neg A)}{P(W)} = \dfrac{0.36}{0.45} = \mathbf{0.8}\).

Q: \(P(A \mid \neg G)\)?

Given: \(P(A) = 0.1\), \(P(G|A) = 0.3\), \(P(G|\neg A) = 0.1\).

Q. Alarm going, given Gibbon does NOT call.

  1. \(0.03\)
  2. \(0.08\)
  3. \(0.30\)
  4. \(0.70\)
B. Marginalize: \(P(\neg G) = P(A) P(\neg G|A) + P(\neg A) P(\neg G|\neg A) = 0.07 + 0.81 = 0.88\).
Bayes: \(P(A | \neg G) = \dfrac{P(A) P(\neg G|A)}{P(\neg G)} = \dfrac{0.07}{0.88} \approx \mathbf{0.08}\).

A universal approach

Any conditional probability can be calculated by combining three rules.

  1. Conditional → joint:  rewrite \(P(X | Y)\) as a ratio of joint probabilities (the product rule, reversed).
  2. Partial joint → full joint:  introduce any missing variables by summing over them (the sum rule, reversed).
  3. Full joint → chain:  factor each full-joint probability using the chain rule.

Worked example: \(P(A | C)\)

Given:  \(P(A) = 0.6\);  \(P(B | A) = 0.4\), \(P(\neg B | \neg A) = 0.2\);  \(P(C | A \wedge B) = 0.1\), \(P(C | \neg A \wedge B) = 0.2\), \(P(C | A \wedge \neg B) = 0.5\), \(P(C | \neg A \wedge \neg B) = 0.8\).

Step 1. Convert the conditional into a ratio of joints:

\(P(A | C) = \dfrac{P(A \wedge C)}{P(C)} = \dfrac{P(A \wedge C)}{P(A \wedge C) + P(\neg A \wedge C)}\)

Worked example: step 2 (marginalize)

From step 1: \(P(A | C) = \dfrac{P(A \wedge C)}{P(A \wedge C) + P(\neg A \wedge C)}\). The numerator and denominator are partial joints — sum out the missing variable \(B\).

Step 2. Sum out \(B\):

\(P(A \wedge C) = P(A \wedge B \wedge C) + P(A \wedge \neg B \wedge C)\)

\(P(\neg A \wedge C) = P(\neg A \wedge B \wedge C) + P(\neg A \wedge \neg B \wedge C)\)

Now every term on the right is a full-joint probability, ready for the chain rule.

Worked example: step 3 + answer

Apply the chain rule \(P(A \wedge B \wedge C) = P(C | A \wedge B)\, P(B | A)\, P(A)\) to each full joint.

\(P(A \wedge B \wedge C) = 0.1 \times 0.4 \times 0.6 = 0.024\)

\(P(A \wedge \neg B \wedge C) = 0.5 \times 0.6 \times 0.6 = 0.180\)

\(P(\neg A \wedge B \wedge C) = 0.2 \times 0.8 \times 0.4 = 0.064\)

\(P(\neg A \wedge \neg B \wedge C) = 0.8 \times 0.2 \times 0.4 = 0.064\)

\(P(A | C) = \dfrac{0.024 + 0.180}{0.024 + 0.180 + 0.064 + 0.064} = \dfrac{0.204}{0.332} \approx \mathbf{0.614}\)

Four rules at a glance

Rule Formula Use when
Sum \(P(A) = \sum_b P(A \wedge B{=}b)\) marginalizing out variables
Product \(P(A \wedge B) = P(A | B)\, P(B)\) defining or applying a conditional
Chain \(P(X_1 \wedge \cdots \wedge X_n) = \prod_i P(X_i | X_1 \wedge \cdots \wedge X_{i-1})\) computing a full joint from conditionals
Bayes' \(P(X | Y) = \dfrac{P(Y | X)\, P(X)}{P(Y)}\) flipping causal → evidential

Learning goals (recap)

Calculate prior, posterior, and joint probabilities using:

  • the sum rule
  • the product rule
  • the chain rule
  • Bayes' rule
  • the universal 3-step recipe

Next: independence and Bayes nets

The chain rule gives a full joint — but \(2^n\) probabilities for \(n\) variables is too many. Conditional independence and Bayesian networks let us write huge joints with far fewer numbers.