CS 486/686 Lecture 9 — Variable Elimination Algorithm

CS 486/686
Variable Elimination Algorithm

Yuntian Deng

Lecture 9

RN 13.4 · PM 8.4

Learning goals

Do inference in Bayes nets more efficiently than the joint
Define factors and the four operations on them
Trace the variable elimination algorithm for prior / posterior queries
Explain how the elimination ordering affects complexity
Know the basics of approximate inference (sampling)

A query in the Holmes BN: \(P(B | w \wedge g)\)

"What's the probability of a burglary, given that both Watson and Gibbon called?"

Query variable: \(B\)
Evidence variables: \(W = w, G = g\)
Hidden variables: \(E, A, R\) (to be summed out)

Convention: capital letters are random variables, lowercase letters are specific values.

Naive: brute-force sum-out

Marginalize the full joint using the BN factorization:

\(P(B | w \wedge g) \propto \sum_e \sum_a \sum_r P(B)\, P(e)\, P(a | B \wedge e)\, P(w | a)\, P(g | a)\, P(r | e)\)

Q1. How many additions + multiplications does this require?

\(\le 10\)
11–20
21–40
41–60
\(\ge 61\)

D — 47 ops. Eight terms in the product, evaluated for every \((b, e, a, r) \in \{T, F\}^4 = 16\) combinations. Lots of redundant work.

Push the sums to the right

\(P(B)\) doesn't depend on \(e, a, r\) — pull it out. \(P(r | e)\) only depends on \(e, r\) — push \(\sum_r\) inward. And so on:

\(P(B)\, \sum_e P(e)\, \underbrace{\textstyle\sum_r P(r | e)}_{= 1}\, \sum_a P(a | B \wedge e)\, P(w | a)\, P(g | a)\)

\(=\; P(B)\, \sum_e P(e)\, \sum_a P(a | B \wedge e)\, P(w | a)\, P(g | a)\)

Q2. Ops now?

B — 14 ops. Inner sum: \(4 \text{ mul} + 1 \text{ add}\); times 2 for \(\sum_a\) over \(\{T, F\}\); plus \(\sum_e\) (4 ops); plus \(P(B)\) multiplication. ~3x speedup just from re-ordering.

The big idea

Reuse intermediate computations — dynamic programming for probabilities.
Exploit conditional independence baked into the Bayes net.
Push sums in as far as possible, then operate on small tables of probabilities — factors.

This is the Variable Elimination Algorithm.

Factors

A factor \(f(X_1, \ldots, X_j)\) is a function from random-variable assignments to a number. It can represent a joint, a conditional, or a partial assignment.

For the Holmes BN, define one factor per CPT:

\(f_1(B) = P(B), \quad f_2(E) = P(E)\)
\(f_3(A, B, E) = P(A | B \wedge E), \quad f_4(R, E) = P(R | E)\)
\(f_5(W, A) = P(W | A), \quad f_6(G, A) = P(G | A)\)

Restrict: assign a variable

Restricting \(f(X_1, X_2, \ldots, X_j)\) to \(X_1 = v_1\) gives a smaller factor \(f(X_1 = v_1, X_2, \ldots, X_j)\) on \(X_2, \ldots, X_j\).

Example: \(f_1(X, Y, Z) \;\to\; f_2(Y, Z) = f_1(X = t, Y, Z)\)

\(f_1(X, Y, Z)\)
X	Y	Z	val
t	t	t	0.1
t	t	f	0.9
t	f	t	0.2
t	f	f	0.8
f	t	t	0.4
f	t	f	0.6
f	f	t	0.3
f	f	f	0.7

\(\xrightarrow{X = t}\)

\(f_2(Y, Z)\)
Y	Z	val
t	t	0.1
t	f	0.9
f	t	0.2
f	f	0.8

Sum out: marginalize a variable

Summing out \(X_1\) from \(f(X_1, \ldots, X_j)\) gives a factor on \(X_2, \ldots, X_j\) defined by \(\bigl(\sum_{X_1} f\bigr)(X_2, \ldots, X_j) = \sum_{v} f(X_1 = v, X_2, \ldots, X_j)\).

Example: \(\sum_Y f_1(X, Y, Z) \;\to\; f_2(X, Z)\)

\(f_1(X, Y, Z)\)
X	Y	Z	val
t	t	t	0.03
t	t	f	0.07
t	f	t	0.54
t	f	f	0.36
f	t	t	0.06
f	t	f	0.14
f	f	t	0.48
f	f	f	0.32

\(\xrightarrow{\sum_Y}\)

\(f_2(X, Z) = \sum_Y f_1\)
X	Z	val
t	t	0.57
t	f	0.43
f	t	0.54
f	f	0.46

Multiply two factors

For \(f_1(X, Y)\) and \(f_2(Y, Z)\) sharing variable \(Y\): \((f_1 \times f_2)(X, Y, Z) = f_1(X, Y) \cdot f_2(Y, Z)\). The result is over the union of variables.

Example: \(f_1(X, Y) \times f_2(Y, Z)\) where \(Y\) is shared.

\(f_1(X, Y)\)
X	Y	val
t	t	0.1
t	f	0.9
f	t	0.2
f	f	0.8

\(f_2(Y, Z)\)
Y	Z	val
t	t	0.3
t	f	0.7
f	t	0.6
f	f	0.4

\(\times\)

\(f_1 \times f_2\)
X	Y	Z	val
t	t	t	0.03
t	t	f	0.07
t	f	t	0.54
t	f	f	0.36
f	t	t	0.06
f	t	f	0.14
f	f	t	0.48
f	f	f	0.32

Normalize a factor

Divide every entry by the sum, so the values sum to 1 (a valid probability distribution).

\(f(Y)\)
Y	val
t	0.2
f	0.6

\(\xrightarrow{\text{normalize}}\)

\(P(Y)\)
Y	val
t	0.25
f	0.75

\(0.2 / (0.2 + 0.6) = 0.25\), \(0.6 / 0.8 = 0.75\).

The variable elimination algorithm

To compute \(P(X_q | X_{o_1} = v_1 \wedge \cdots \wedge X_{o_j} = v_j)\):

variable-elimination(BN, query, evidence)

Construct a factor for each CPT in the BN.
Restrict every observed variable to its value.
For each hidden variable X_h (in chosen order): multiply all factors that contain X_h, then sum out X_h from the product.
Multiply the remaining factors.
Normalize the result.

Each iteration removes one hidden variable while keeping all factors small.

Worked example: \(P(B | \neg a)\)

Sub-BN for the example. \(B\) = query, \(A = \text{false}\) = evidence.

Given CPTs:

Define factors:

\(f_1(B), \; f_2(E), \; f_3(A, B, E), \; f_4(W, A)\)

Eliminate hidden variables

Restrict \(A = \text{false}\): \(f_3(A, B, E) \to f_5(B, E)\) \(f_4(W, A) \to f_6(W)\) Sum out \(W\) from \(f_6\): \(f_7() = \sum_W f_6(W) = 0.4 + 0.6 = 1.0\) Multiply \(f_2 \times f_5\): \(f_8(B, E)\) entries are products of marginal and conditional Sum out \(E\) from \(f_8\): \(f_9(B) = \sum_E f_8(B, E)\)

\(f_5(B, E) = f_3(\neg a, B, E)\)
B	E	val
t	t	0.2
t	f	0.3
f	t	0.8
f	f	0.9

\(f_8(B, E) = f_2 \times f_5\)
B	E	val
t	t	0.2 × 0.1 = 0.02
t	f	0.3 × 0.9 = 0.27
f	t	0.8 × 0.1 = 0.08
f	f	0.9 × 0.9 = 0.81

\(f_9(B) = \sum_E f_8\)
B	val
t	0.02 + 0.27 = 0.29
f	0.08 + 0.81 = 0.89

Multiply remaining factors, normalize

Remaining factors: \(f_1(B), f_9(B), f_7() = 1.0\). Multiply them:

\(f_{10}(B) = f_1 \times f_9 \times f_7\)
B	val
t	0.3 × 0.29 × 1 = 0.087
f	0.7 × 0.89 × 1 = 0.623

\(\xrightarrow{\text{normalize}}\)

\(P(B | \neg a)\)
B	val
t	0.087 / 0.710 \(\approx\) 0.1225
f	0.623 / 0.710 \(\approx\) 0.8775

Burglary is unlikely (\(\approx 12\%\)) given that the alarm did NOT sound.

Ordering matters

Compute \(P(G)\) (no evidence). Initial factors:

\(f_1(E), f_2(E, R), f_3(B)\)
\(f_4(E, B, A), f_5(W, A), f_6(G, A)\)

Every ordering of \(\{R, W, E, B, A\}\) yields a correct algorithm.
Different orderings produce different intermediate factors.
Larger intermediate factors \(\Rightarrow\) more space + ops.

Two orderings, very different costs

Good: \(R \to W \to E \to B \to A\)

\(f_2(E, R) \to f_7(E)\): 2 ops
\(f_5(W, A) \to f_8(A)\): 2 ops
\(f_1 \cdot f_7 \cdot f_4 \to f_9(B, A)\): 20 ops
\(f_9 \cdot f_8 \to f_{10}(A)\): 6 ops
\(f_{10} \cdot f_6 \to f_{11}(G)\): 6 ops

Total: 36 ops

Bad: \(A \to B \to E \to R \to W\)

\(f_4 \cdot f_5 \cdot f_6 \to f_7(E, B, W, G)\): 80 ops
\(f_7 \cdot f_3 \to f_8(E, W, G)\): 24 ops
\(f_1 \cdot f_8 \cdot f_2 \to f_9(W, G, R)\): 24 ops
\(f_9 \to f_{10}(W, G)\): 4 ops
\(f_{10} \to f_{11}(G)\): 2 ops

Total: 134 ops

The bad ordering builds a 4-variable intermediate factor \(f_7(E, B, W, G)\) — that's what we pay for.

Elimination width + tree width

Hypergraph view: each factor is a hyperedge over its variables.
Elimination width of ordering \(\pi\): the largest hyperedge size produced during elimination.
Tree width \(\omega\) = minimum elim width over all orderings.
VE complexity is \(2^{O(\omega)}\) in both time and space.
Finding an optimal ordering is NP-hard; in practice, heuristics work well (often \(\omega \approx 8\)–\(10\) even with 1000s of variables).

Hyperedges (colored bubbles) = factors.

Polytrees: an easy case

A polytree is a singly-connected BN: at most one undirected path between any two nodes. (Nodes may still have multiple parents.)

Heuristic: always eliminate a singly-connected node.
A singly-connected node always exists \(\Rightarrow\) factor size never grows.
Tree width = max # of parents of any node.
VE runs in time linear in the size of the network.

Eliminate leaves D, E, F first — factors never grow beyond original CPTs.

Forward sampling: estimate \(P(G)\)

When the BN is too large for exact VE, just simulate it.

for i in 1..n: e_i ~ P(E) b_i ~ P(B) a_i ~ P(A | E=e_i, B=b_i) g_i ~ P(G | A=a_i) P(G = T) ≈ (1/n) ∑_i g_i

Sample in topological order so every CPT input is already known. Estimate converges to true \(P(G)\) as \(n \to \infty\).

Rejection sampling: estimate \(P(G | \neg b)\)

for i in 1..n: forward-sample e_i, b_i, a_i, g_i if b_i == T: reject sample else: keep g̃_i P(G = T | ¬b) ≈ (∑_i g̃_i) / ñ

Only "accepted" samples (consistent with evidence) count.
Wasteful when evidence is rare — most samples thrown away.

Learning goals (recap)

✓ Do inference in Bayes nets more efficiently than the joint
✓ Define factors and the four operations on them
✓ Trace the variable elimination algorithm
✓ Explain how the elimination ordering affects complexity
✓ Know the basics of approximate inference

Next: Hidden Markov Models

A Bayes net unrolled over time — inference about what's happening now given a sequence of observations.

VEA in disguise: forward-backward, Viterbi, smoothing.