Course page
CS 486/686
Variable Elimination Algorithm

Lecture 9

RN 13.4 · PM 8.4

Search Uncertainty Decisions Learning

Learning goals

  • Do inference in Bayes nets more efficiently than the joint
  • Define factors and the four operations on them
  • Trace the variable elimination algorithm for prior / posterior queries
  • Explain how the elimination ordering affects complexity
  • Know the basics of approximate inference (sampling)

A query in the Holmes BN: \(P(B | w \wedge g)\)

E R B A W G

"What's the probability of a burglary, given that both Watson and Gibbon called?"

  • Query variable:  \(B\)
  • Evidence variables:  \(W = w, G = g\)
  • Hidden variables:  \(E, A, R\) (to be summed out)

Convention: capital letters are random variables, lowercase letters are specific values.

Naive: brute-force sum-out

Marginalize the full joint using the BN factorization:

\(P(B | w \wedge g) \propto \sum_e \sum_a \sum_r P(B)\, P(e)\, P(a | B \wedge e)\, P(w | a)\, P(g | a)\, P(r | e)\)

Q1. How many additions + multiplications does this require?

  1. \(\le 10\)
  2. 11–20
  3. 21–40
  4. 41–60
  5. \(\ge 61\)
D — 47 ops. Eight terms in the product, evaluated for every \((b, e, a, r) \in \{T, F\}^4 = 16\) combinations. Lots of redundant work.

Push the sums to the right

\(P(B)\) doesn't depend on \(e, a, r\) — pull it out. \(P(r | e)\) only depends on \(e, r\) — push \(\sum_r\) inward. And so on:

\(P(B)\, \sum_e P(e)\, \underbrace{\textstyle\sum_r P(r | e)}_{= 1}\, \sum_a P(a | B \wedge e)\, P(w | a)\, P(g | a)\)

\(=\; P(B)\, \sum_e P(e)\, \sum_a P(a | B \wedge e)\, P(w | a)\, P(g | a)\)

Q2. Ops now?

B — 14 ops. Inner sum: \(4 \text{ mul} + 1 \text{ add}\); times 2 for \(\sum_a\) over \(\{T, F\}\); plus \(\sum_e\) (4 ops); plus \(P(B)\) multiplication. ~3x speedup just from re-ordering.

The big idea

  • Reuse intermediate computations — dynamic programming for probabilities.
  • Exploit conditional independence baked into the Bayes net.
  • Push sums in as far as possible, then operate on small tables of probabilities — factors.

This is the Variable Elimination Algorithm.

Factors

A factor \(f(X_1, \ldots, X_j)\) is a function from random-variable assignments to a number. It can represent a joint, a conditional, or a partial assignment.

For the Holmes BN, define one factor per CPT:

\(f_1(B) = P(B), \quad f_2(E) = P(E)\)
\(f_3(A, B, E) = P(A | B \wedge E), \quad f_4(R, E) = P(R | E)\)
\(f_5(W, A) = P(W | A), \quad f_6(G, A) = P(G | A)\)

Restrict: assign a variable

Restricting \(f(X_1, X_2, \ldots, X_j)\) to \(X_1 = v_1\) gives a smaller factor \(f(X_1 = v_1, X_2, \ldots, X_j)\) on \(X_2, \ldots, X_j\).

Example: \(f_1(X, Y, Z) \;\to\; f_2(Y, Z) = f_1(X = t, Y, Z)\)

\(f_1(X, Y, Z)\)
XYZval
ttt0.1
ttf0.9
tft0.2
tff0.8
ftt0.4
ftf0.6
fft0.3
fff0.7

\(\xrightarrow{X = t}\)

\(f_2(Y, Z)\)
YZval
tt0.1
tf0.9
ft0.2
ff0.8

Sum out: marginalize a variable

Summing out \(X_1\) from \(f(X_1, \ldots, X_j)\) gives a factor on \(X_2, \ldots, X_j\) defined by  \(\bigl(\sum_{X_1} f\bigr)(X_2, \ldots, X_j) = \sum_{v} f(X_1 = v, X_2, \ldots, X_j)\).

Example: \(\sum_Y f_1(X, Y, Z) \;\to\; f_2(X, Z)\)

\(f_1(X, Y, Z)\)
XYZval
ttt0.03
ttf0.07
tft0.54
tff0.36
ftt0.06
ftf0.14
fft0.48
fff0.32

\(\xrightarrow{\sum_Y}\)

\(f_2(X, Z) = \sum_Y f_1\)
XZval
tt0.57
tf0.43
ft0.54
ff0.46

Multiply two factors

For \(f_1(X, Y)\) and \(f_2(Y, Z)\) sharing variable \(Y\):  \((f_1 \times f_2)(X, Y, Z) = f_1(X, Y) \cdot f_2(Y, Z)\).  The result is over the union of variables.

Example: \(f_1(X, Y) \times f_2(Y, Z)\) where \(Y\) is shared.

\(f_1(X, Y)\)
XYval
tt0.1
tf0.9
ft0.2
ff0.8
\(f_2(Y, Z)\)
YZval
tt0.3
tf0.7
ft0.6
ff0.4

\(\times\)

\(f_1 \times f_2\)
XYZval
ttt0.03
ttf0.07
tft0.54
tff0.36
ftt0.06
ftf0.14
fft0.48
fff0.32

Normalize a factor

Divide every entry by the sum, so the values sum to 1 (a valid probability distribution).

\(f(Y)\)
Yval
t0.2
f0.6

\(\xrightarrow{\text{normalize}}\)

\(P(Y)\)
Yval
t0.25
f0.75

\(0.2 / (0.2 + 0.6) = 0.25\),  \(0.6 / 0.8 = 0.75\).

The variable elimination algorithm

To compute \(P(X_q | X_{o_1} = v_1 \wedge \cdots \wedge X_{o_j} = v_j)\):

variable-elimination(BN, query, evidence)
  1. Construct a factor for each CPT in the BN.
  2. Restrict every observed variable to its value.
  3. For each hidden variable Xh (in chosen order): multiply all factors that contain Xh, then sum out Xh from the product.
  4. Multiply the remaining factors.
  5. Normalize the result.

Each iteration removes one hidden variable while keeping all factors small.

Worked example: \(P(B | \neg a)\)

E B A W

Sub-BN for the example. \(B\) = query, \(A = \text{false}\) = evidence.

Given CPTs:

\(P(B) = 0.3, \quad P(E) = 0.1\)
\(P(A | E,\, B) = 0.8, \;\; P(A | E,\, \neg B) = 0.2\)
\(P(A | \neg E,\, B) = 0.7, \;\; P(A | \neg E,\, \neg B) = 0.1\)
\(P(W | A) = 0.8, \;\; P(W | \neg A) = 0.4\)

Define factors:

\(f_1(B), \; f_2(E), \; f_3(A, B, E), \; f_4(W, A)\)

Eliminate hidden variables

Restrict \(A = \text{false}\):  \(f_3(A, B, E) \to f_5(B, E)\)   \(f_4(W, A) \to f_6(W)\) Sum out \(W\) from \(f_6\):  \(f_7() = \sum_W f_6(W) = 0.4 + 0.6 = 1.0\) Multiply \(f_2 \times f_5\): \(f_8(B, E)\) entries are products of marginal and conditional Sum out \(E\) from \(f_8\):  \(f_9(B) = \sum_E f_8(B, E)\)

\(f_5(B, E) = f_3(\neg a, B, E)\)
BEval
tt0.2
tf0.3
ft0.8
ff0.9
\(f_8(B, E) = f_2 \times f_5\)
BEval
tt0.2 × 0.1 = 0.02
tf0.3 × 0.9 = 0.27
ft0.8 × 0.1 = 0.08
ff0.9 × 0.9 = 0.81
\(f_9(B) = \sum_E f_8\)
Bval
t0.02 + 0.27 = 0.29
f0.08 + 0.81 = 0.89

Multiply remaining factors, normalize

Remaining factors:  \(f_1(B), f_9(B), f_7() = 1.0\). Multiply them:

\(f_{10}(B) = f_1 \times f_9 \times f_7\)
Bval
t0.3 × 0.29 × 1 = 0.087
f0.7 × 0.89 × 1 = 0.623

\(\xrightarrow{\text{normalize}}\)

\(P(B | \neg a)\)
Bval
t0.087 / 0.710 \(\approx\) 0.1225
f0.623 / 0.710 \(\approx\) 0.8775

Burglary is unlikely (\(\approx 12\%\)) given that the alarm did NOT sound.

Ordering matters

E R B A W G

Compute \(P(G)\) (no evidence). Initial factors:

\(f_1(E), f_2(E, R), f_3(B)\)
\(f_4(E, B, A), f_5(W, A), f_6(G, A)\)

  • Every ordering of \(\{R, W, E, B, A\}\) yields a correct algorithm.
  • Different orderings produce different intermediate factors.
  • Larger intermediate factors \(\Rightarrow\) more space + ops.

Two orderings, very different costs

Good: \(R \to W \to E \to B \to A\)

  • \(f_2(E, R) \to f_7(E)\): 2 ops
  • \(f_5(W, A) \to f_8(A)\): 2 ops
  • \(f_1 \cdot f_7 \cdot f_4 \to f_9(B, A)\): 20 ops
  • \(f_9 \cdot f_8 \to f_{10}(A)\): 6 ops
  • \(f_{10} \cdot f_6 \to f_{11}(G)\): 6 ops

Total: 36 ops

Bad: \(A \to B \to E \to R \to W\)

  • \(f_4 \cdot f_5 \cdot f_6 \to f_7(E, B, W, G)\): 80 ops
  • \(f_7 \cdot f_3 \to f_8(E, W, G)\): 24 ops
  • \(f_1 \cdot f_8 \cdot f_2 \to f_9(W, G, R)\): 24 ops
  • \(f_9 \to f_{10}(W, G)\): 4 ops
  • \(f_{10} \to f_{11}(G)\): 2 ops

Total: 134 ops

The bad ordering builds a 4-variable intermediate factor \(f_7(E, B, W, G)\) — that's what we pay for.

Elimination width + tree width

  • Hypergraph view: each factor is a hyperedge over its variables.
  • Elimination width of ordering \(\pi\): the largest hyperedge size produced during elimination.
  • Tree width \(\omega\) = minimum elim width over all orderings.
  • VE complexity is \(2^{O(\omega)}\) in both time and space.
  • Finding an optimal ordering is NP-hard; in practice, heuristics work well (often \(\omega \approx 8\)–\(10\) even with 1000s of variables).
X Y Z U V

Hyperedges (colored bubbles) = factors.

Polytrees: an easy case

A polytree is a singly-connected BN: at most one undirected path between any two nodes. (Nodes may still have multiple parents.)
  • Heuristic: always eliminate a singly-connected node.
  • A singly-connected node always exists \(\Rightarrow\) factor size never grows.
  • Tree width = max # of parents of any node.
  • VE runs in time linear in the size of the network.
A B C D E F

Eliminate leaves D, E, F first — factors never grow beyond original CPTs.

Forward sampling: estimate \(P(G)\)

E R B A W G

When the BN is too large for exact VE, just simulate it.

for i in 1..n: ei ~ P(E) bi ~ P(B) ai ~ P(A | E=ei, B=bi) gi ~ P(G | A=ai) P(G = T) ≈ (1/n) ∑i gi

Sample in topological order so every CPT input is already known. Estimate converges to true \(P(G)\) as \(n \to \infty\).

Rejection sampling: estimate \(P(G | \neg b)\)

E R B A W G
for i in 1..n: forward-sample ei, bi, ai, gi if bi == T: reject sample else: keep g̃i P(G = T | ¬b) ≈ (∑ii) / ñ
  • Only "accepted" samples (consistent with evidence) count.
  • Wasteful when evidence is rare — most samples thrown away.

Learning goals (recap)

  • ✓  Do inference in Bayes nets more efficiently than the joint
  • ✓  Define factors and the four operations on them
  • ✓  Trace the variable elimination algorithm
  • ✓  Explain how the elimination ordering affects complexity
  • ✓  Know the basics of approximate inference

Next: Hidden Markov Models

A Bayes net unrolled over time — inference about what's happening now given a sequence of observations.

VEA in disguise: forward-backward, Viterbi, smoothing.