Lecture 14
RN 17.2 · PM 9.5.2–9.5.3
From L13, the optimal policy is one greedy step from \(V^\star\):
\(\pi^\star(s) = \arg\max_a Q^\star(s, a), \quad Q^\star(s, a) = R(s) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\star(s').\)
How do we actually compute \(V^\star\)?
Two algorithms: value iteration (iterate values) and policy iteration (iterate policies).
"Value" = expected discounted future reward.
Return (discounted future reward from time \(t\)):
\(G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + \cdots = R_{t+1} + \gamma G_{t+1}.\)
Value functions are expected returns:
\(V^\pi(s) = \mathbb{E}_\pi\bigl[G_t \mid s_t = s\bigr], \quad Q^\pi(s, a) = \mathbb{E}_\pi\bigl[G_t \mid s_t = s, a_t = a\bigr].\)
The two are related (each is a sum of the other):
\(V^\pi(s) = \sum_a \pi(a \mid s)\, Q^\pi(s, a)\)
\(Q^\pi(s, a) = R(s) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\pi(s')\)
A picture of the V/Q recursion: policy branches on actions; environment branches on next states.
For the optimal policy, choose the best action at every state:
\(V^\star(s) = \max_a Q^\star(s, a), \quad Q^\star(s, a) = R(s) + \gamma \sum_{s'} P(s' \mid s, a)\, V^\star(s').\)
Substitute one into the other:
\(V^\star(s) = R(s) + \gamma\, \max_a \sum_{s'} P(s' \mid s, a)\, V^\star(s').\)
\(V^\star\) is the unique solution to this system of \(|S|\) equations.
From \(s_{11}\), each action gives a different mixture (0.8 intended + 0.1 each side):
\(V^\star(s_{11}) = -0.04 + \gamma\, \max\{\)
\(0.8\,V^\star(s_{12}) + 0.1\,V^\star(s_{21}) + 0.1\,V^\star(s_{11})\), // right
\(0.9\,V^\star(s_{11}) + 0.1\,V^\star(s_{12})\), // up
\(0.9\,V^\star(s_{11}) + 0.1\,V^\star(s_{21})\), // left
\(0.8\,V^\star(s_{21}) + 0.1\,V^\star(s_{12}) + 0.1\,V^\star(s_{11})\) // down
\(\}\)
Q1. Can we solve this system of Bellman equations in polynomial time?
Treat the Bellman equation as an update rule. Let \(V_i(s)\) be the \(i^{\text{th}}\) estimate.
Guarantee: \(V_i \to V^\star\) as \(i \to \infty\).
Setup: \(\gamma = 1\), \(R(s) = -0.04\) for non-goal states. Terminals: \(V(s_{24}) = -1, V(s_{34}) = +1\). Init \(V_0(s) = 0\) for all non-terminals.
After one Bellman update, only the cells adjacent to a goal will see something different from \(-0.04\).
Apply \(V_1(s) \leftarrow R(s) + \gamma \max_a \sum_{s'} P(s' \mid s, a) V_0(s')\) at every cell. Two cells worth a closer look:
Q2. \(V_1(s_{23})\) (yellow): all neighbors are \(0\); best action is Left (avoids \(-1\)). \(\Rightarrow -0.04 + 0 = \boxed{-0.04}\).
Q3. \(V_1(s_{33})\) (green): Right has \(0.8 \cdot 1 + 0.1 \cdot 0 + 0.1 \cdot 0 = 0.8\). \(\Rightarrow -0.04 + 0.8 = \boxed{0.76}\).
The +1 reward has "leaked" one cell up; the −1 hasn't propagated because we avoid it.
Apply Bellman again, using \(V_1\) as input. Three cells worth a closer look:
Q4. \(V_2(s_{33}):\) Right gives \(0.8(+1) + 0.1(-0.04) + 0.1(0.76) = 0.872\). \(\Rightarrow -0.04 + 0.872 = \boxed{0.832}\).
Q5. \(V_2(s_{23}):\) Down gives \(0.8(0.76) + 0.1(-0.04) + 0.1(-1) = 0.504\). \(\Rightarrow -0.04 + 0.504 = \boxed{0.464}\).
Q6. \(V_2(s_{32}):\) Right gives \(0.8(0.76) + 0.1(-0.04) + 0.1(-0.04) = 0.6\). \(\Rightarrow -0.04 + 0.6 = \boxed{0.56}\).
Values propagate one cell per iteration. After enough sweeps they converge to the L13 \(V^\star\).
Insight: if one action is clearly better, you don't need precise utility values — just enough to rank actions.
Terminates in finitely many steps: there are finitely many policies and each iteration strictly improves.
In PE the action is fixed by \(\pi\), so the system becomes linear in \(V\):
\(V^\pi(s) = R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s')\)
No \(\max\) — the action is fixed. \(|S|\) linear equations in \(|S|\) unknowns.
linear: easy to solve\(V^\star(s) = R(s) + \gamma\, \max_a \sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
The \(\max\) makes it nonlinear — need an iterative algorithm (VI).
nonlinear: harderPolicy improvement reintroduces the \(\arg\max\): \(\pi_{i+1}(s) = \arg\max_a \sum_{s'} P(s' \mid s, a)\, V^{\pi_i}(s')\).
Two ways to solve the linear system \(V^\pi(s) = R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s')\):
Linear algebra: a single solve of an \(|S| \times |S|\) system.
cost: \(O(|S|^3)\)
Run \(m\) simplified Bellman updates (no \(\max\)):
\(V_{j+1}(s) \leftarrow R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V_j(s')\)
cost: \(O(m \cdot |S| \cdot |A_{\text{avg}}|)\)
In practice, a small \(m\) is usually enough — PI converges in far fewer outer iterations than VI.
Today we knew the dynamics \(P\) and rewards \(R\) of the MDP. What if we don't?
L15: learn the policy by interacting with the environment.