Lecture 14
RN 17.2 · PM 9.5.2–9.5.3
⏰ Assignment 1 is due TODAY (Thu Jun 25), 11:59 PM on LEARN.
📝 Need an extension? Contact Dake Zhang (A1 owner) on Piazza or d346zhan@uwaterloo.ca.
From L13, the optimal policy is one greedy step from \(V^\star\):
\(\pi^\star(s) = \arg\max_a Q^\star(s, a).\)
How do we actually compute \(V^\star\)?
Two algorithms: value iteration (iterate values) and policy iteration (iterate policies).
\(V^\star(s)\) — the best expected total (discounted) reward achievable starting from state \(s\) and acting optimally thereafter.
\(Q^\star(s, a)\) — the best expected total reward if you start in \(s\), take action \(a\) now, then act optimally.
Since you would just pick the best first action:
\(V^\star(s) = \displaystyle\max_a Q^\star(s, a)\)
connection ①
Take action \(a\) in state \(s\). What happens?
\[ \begin{aligned} Q^\star(s, a) &= R(s) + \mathbb{E}_{s' \sim P(\cdot \mid s, a)}\bigl[V^\star(s')\bigr] \\ &= R(s) + \sum_{s'} P(s' \mid s, a)\, V^\star(s') \end{aligned} \]
One fix remaining: future reward should be discounted.
A reward received one step later is worth less than a reward now, so we discount the downstream value \(V^\star(s')\) by \(\gamma \in [0, 1)\):
\(Q^\star(s, a) = R(s) + \gamma \displaystyle\sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
connection ②
Why discount? Future reward is less certain, and \(\gamma < 1\) keeps the long-run sum finite (and the upcoming iteration convergent).
The two connections between \(V^\star\) and \(Q^\star\):
① \(V^\star(s) = \displaystyle\max_a Q^\star(s, a)\)
② \(Q^\star(s, a) = R(s) + \gamma \displaystyle\sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
Substitute ② into ①:
\(V^\star(s) = R(s) + \gamma\, \displaystyle\max_a \sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
\(|S|\) coupled equations; \(V^\star\) is their unique solution. Turn it into an update rule → value iteration.
From \(s_{11}\), each action gives a different mixture (0.8 intended + 0.1 each side):
\(V^\star(s_{11}) = -0.04 + \gamma\, \max\{\)
\(0.8\,V^\star(s_{12}) + 0.1\,V^\star(s_{21}) + 0.1\,V^\star(s_{11})\), // right
\(0.9\,V^\star(s_{11}) + 0.1\,V^\star(s_{12})\), // up
\(0.9\,V^\star(s_{11}) + 0.1\,V^\star(s_{21})\), // left
\(0.8\,V^\star(s_{21}) + 0.1\,V^\star(s_{12}) + 0.1\,V^\star(s_{11})\) // down
\(\}\)
Q1. Can we solve this system of Bellman equations in polynomial time?
Treat the Bellman equation as an update rule. Let \(V_i(s)\) be the \(i^{\text{th}}\) estimate.
Guarantee: \(V_i \to V^\star\) as \(i \to \infty\).
Setup: \(\gamma = 1\), \(R(s) = -0.04\) for non-goal states. Terminals: \(V(s_{24}) = -1, V(s_{34}) = +1\). Init \(V_0(s) = 0\) for all non-terminals.
After one Bellman update, only the cells adjacent to a goal will see something different from \(-0.04\).
\(s_{23}\) is just left of the \(-1\) terminal. Update from \(V_0\) (all \(0\) except the \(\pm1\) terminals), \(\gamma=1\); each action is \(0.8\) intended + \(0.1\) each side (a wall bump stays put).
\(s_{33}\) sits just left of the \(+1\) terminal. Same update from \(V_0\), \(\gamma = 1\):
The \(+1\) reward has "leaked" one cell up; values propagate one cell per iteration.
Apply Bellman again, using \(V_1\) as input. For each cell we show its best action — the one \(\max_a\) selects. Three cells worth a closer look:
Q4. \(V_2(s_{33}):\) best action Right gives \(0.8(+1) + 0.1(-0.04) + 0.1(0.76) = 0.872\). \(\Rightarrow -0.04 + 0.872 = \boxed{0.832}\).
Q5. \(V_2(s_{23}):\) best action Down gives \(0.8(0.76) + 0.1(-0.04) + 0.1(-1) = 0.504\). \(\Rightarrow -0.04 + 0.504 = \boxed{0.464}\).
Q6. \(V_2(s_{32}):\) best action Right gives \(0.8(0.76) + 0.1(-0.04) + 0.1(-0.04) = 0.6\). \(\Rightarrow -0.04 + 0.6 = \boxed{0.56}\).
Values propagate one cell per iteration. After enough sweeps they converge to the L13 \(V^\star\).
\(V^\pi(s)\) — the expected discounted return if you start in \(s\) and follow the fixed policy \(\pi\) forever.
With the action pinned to \(\pi(s)\), it satisfies one linear equation per state:
\(V^\pi(s) = R(s) + \gamma \displaystyle\sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s')\)
\(V^\star(s) = \displaystyle\max_\pi V^\pi(s)\). Policy iteration repeatedly computes \(V^\pi\), then makes \(\pi\) greedy w.r.t. it.
Insight: if one action is clearly better, you don't need precise utility values — just enough to rank actions.
Terminates in finitely many steps: there are finitely many policies and each iteration strictly improves.
In PE the action is fixed by \(\pi\), so the system becomes linear in \(V\):
\(V^\pi(s) = R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s')\)
No \(\max\) — the action is fixed. \(|S|\) linear equations in \(|S|\) unknowns.
linear: easy to solve\(V^\star(s) = R(s) + \gamma\, \max_a \sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
The \(\max\) makes it nonlinear — need an iterative algorithm (VI).
nonlinear: harderPolicy improvement reintroduces the \(\arg\max\): \(\pi_{i+1}(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)\, V^{\pi_i}(s')\).
Two ways to solve the linear system \(V^\pi(s) = R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s')\):
Linear algebra: a single solve of an \(|S| \times |S|\) system.
cost: \(O(|S|^3)\)
Run \(m\) simplified Bellman updates (no \(\max\)):
\(V_{j+1}(s) \leftarrow R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V_j(s')\)
cost: \(O(m \cdot |S| \cdot b)\)
\(b\) = avg # of successor states (those with \(P(s'\mid s, \pi(s)) > 0\)). No sum over actions — \(\pi\) fixes the action.
In practice, a small \(m\) is usually enough — PI converges in far fewer outer iterations than VI.
Same 3×4 grid (\(\gamma = 1\), \(R = -0.04\)). Start from a deliberately mediocre policy — every state goes right.
Policy evaluation = solve one linear equation per state. For \(s_{11}\) under \(\pi_0\) (go right):
\(V^{\pi_0}(s_{11}) = -0.04 + 0.8\,V^{\pi_0}(s_{12}) + 0.1\,V^{\pi_0}(s_{11}) + 0.1\,V^{\pi_0}(s_{21})\)
9 such equations (one per non-terminal state) → hand the linear system to a solver.
\(\pi_0\) drove the top half straight into the \(-1\) (values \(\approx -1.4\)). Greedy improvement turns those states down/around toward the \(+1\).
Values jump up; the top row now steers away from the \(-1\). Only the corner near \(-1\) still needs adjusting.
A couple more rounds settle the top row, and the policy stops changing — that is \(\pi^\star\).
Exactly the \(V^\star\) and optimal policy from L13 — value iteration and policy iteration agree.
Today we knew the dynamics \(P\) and rewards \(R\) of the MDP. What if we don't?
L15: learn the policy by interacting with the environment.