Lecture 15
RN 22.1–22.3 · PM 12.1, 12.5, 12.8
🎥 I'll be away at ICML the week of July 7. L17 (Tue Jul 7) and L18 (Thu Jul 9) will be pre-recorded videos — no in-person class those days. Watch on your own time (links posted before class).
📅 Deadlines are unchanged: Chat 8 + the CS686 project proposal are due Tue Jul 7; Chat 9 is out and Assignment 2 is due Thu Jul 9.
💬 Questions? Post on Piazza — the TAs are available all week.
An MDP whose transition \(P\) and reward \(R\) are unknown. The agent sees state \(s\) and reward \(r\), picks an action \(a\), and observes the next state \(s'\).
Goal: maximize discounted return — while learning the world as it goes.
Fix a policy \(\pi\); goal: learn \(V^\pi(s)\) for every state \(s\). The catch: we no longer know the reward \(R(s)\) or the transitions \(P(s' \mid s, a)\) — we only get to observe experiences.
\(V^\pi(s) = R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s').\)
A model-based approach — it builds an explicit model of the world.
Same grid world as before (the agent moves in the intended direction, or slips to either side) — but now we don't know those slip probabilities, so we count outcomes to estimate \(P\).
\(\pi(s_{11}) = \text{down}\), \(\pi(s_{21}) = \text{right}\), \(\gamma = 0.9\).
Counts so far: \(N(s, a) = 5\), with \(N(s, a, s') = 3\) for the intended direction and \(1\) for each side.
New experience: \(\langle s_{11}, \text{down}, s_{21}, -0.04 \rangle\).
Increment counts: \(N(s_{11}, \text{down}){:}\, 5 {\to} 6\), \(\ N(s_{11}, \text{down}, s_{21}){:}\, 3 {\to} 4\).
Re-estimate \(P(\cdot \mid s_{11}, \text{down})\) as count fractions:
\(s_{21}{:}\, \tfrac{4}{6}, \quad \text{stay } s_{11}{:}\, \tfrac{1}{6}, \quad +1{:}\, \tfrac{1}{6}\)
(Going down can slip sideways: left bumps the wall → stay; right → the \(+1\) cell.)
Reminder — policy evaluation: \(V^\pi(s) = R(s) + \gamma \sum_{s'} P(s' \mid s, \pi(s))\, V^\pi(s')\) (\(R = -0.04\), \(\gamma = 0.9\)).
Estimate \(P(\cdot \mid s_{21}, \text{right})\) the same way (counts \(3\) intended, \(1\) each side, out of \(5\)): right \(\to -1\) is \(\tfrac{3}{5}\); up \(\to s_{11}\) is \(\tfrac{1}{5}\); down bumps the wall \(\to\) stay \(s_{21}\) is \(\tfrac{1}{5}\).
\(V(s_{11}) = -0.04 + 0.9\bigl[\tfrac{4}{6} V(s_{21}) + \tfrac{1}{6} V(s_{11}) + \tfrac{1}{6}(+1)\bigr]\)
\(V(s_{21}) = -0.04 + 0.9\bigl[\tfrac{3}{5}(-1) + \tfrac{1}{5} V(s_{11}) + \tfrac{1}{5} V(s_{21})\bigr]\)
Two linear equations, two unknowns — solve:
\(V(s_{11}) = -0.438, \quad V(s_{21}) = -0.803.\)
Q1. In a 3x4 grid, suppose \(N(s_{23}, \text{down}) = 23\) and \(N(s_{23}, \text{down}, s_{24}) = 2\). The agent tries to move down but slips and lands in \(s_{24}\). What is the updated \(P(s_{24} \mid s_{23}, \text{down})\)?
Now the agent chooses its actions. Two competing pressures:
Take the action that maximizes the current \(V(s)\) (or \(Q(s, a)\)). Reap known rewards.
Try a different action. Gather data to learn a better model / better Q-values.
A purely greedy agent seldom converges to the optimal policy — the learned model is biased toward states the agent has already preferred.
With probability \(\varepsilon\): take a random action (explore).
With probability \(1 - \varepsilon\): take the greedy action (exploit).
Start with a large \(\varepsilon\) and decay it over time — explore early, exploit once the values settle.
Simple and popular, but it treats every sub-optimal action as equally worth trying.
Pick actions in proportion to their value, tuned by a temperature \(T\):
\(P(a) = \dfrac{e^{Q(s, a) / T}}{\sum_{a'} e^{Q(s, a') / T}}\)
Unlike \(\varepsilon\)-greedy, better actions are explored more than clearly-bad ones. Cool \(T\) over time (like simulated annealing).
Assume the best about actions you haven't tried enough. Replace a value \(u\) with an exploration function \(f\):
\(f(u, n) = \begin{cases} R^+ & n < N_e \\ u & \text{otherwise}\end{cases}\)
\(R^+\) is an optimistic reward bound; \(N_e\) is how many tries before we trust the real estimate \(u\).
Every action looks great until tried \(N_e\) times, pulling the agent toward the unknown.
Active ADP = passive ADP, but now the agent chooses actions. The trick: act greedily on an optimistic value \(V^+\), seen through the exploration function \(f\) from before.
Pick the action whose successor value — viewed through \(f\) — looks best:
\(a = \arg\max_{a}\, f\Bigl(\, \underbrace{\textstyle\sum_{s'} P(s'\mid s,a)\, V^+(s')}_{\text{estimated value of } a},\ \ \underbrace{N(s,a)}_{\text{times tried}} \,\Bigr)\)
\(f\) inflates the value of under-tried actions (\(N(s,a) < N_e\)), so the agent is pulled toward exploring them.
Same model-based loop as passive ADP — only the action choice and the \(\max\) now pass through \(f\).
From L14, two connections between \(V^\star\) and \(Q^\star\):
① \(V^\star(s) = \displaystyle\max_a Q^\star(s, a)\)
② \(Q^\star(s, a) = R(s) + \gamma \displaystyle\sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
Substitute ② into ① (replace each \(Q^\star\) with reward + future value):
\(V^\star(s) = R(s) + \gamma\, \displaystyle\max_a \sum_{s'} P(s' \mid s, a)\, V^\star(s')\)
Now do the reverse — substitute ① \(\bigl(V^\star(s') = \max_{a'} Q^\star(s', a')\bigr)\) into ②:
\(Q^\star(s, a) = R(s) + \gamma \displaystyle\sum_{s'} P(s' \mid s, a)\, \underline{V^\star(s')}\)
\(= R(s) + \gamma \displaystyle\sum_{s'} P(s' \mid s, a)\, \underline{\max_{a'} Q^\star(s', a')}\)
Given \(Q^\star\), the best action is \(\pi^\star(s) = \arg\max_a Q^\star(s, a)\) — no \(P\) needed. That is what lets \(Q\)-learning be model-free.
After one transition \(\langle s, a, r, s' \rangle\), pretend that transition was certain (\(P(s' \mid s, a) = 1\)). Then by Bellman, \(Q(s, a)\) should equal the target:
\(Q(s, a) \;\stackrel{?}{=}\; \underbrace{R(s) + \gamma\, \max_{a'} Q(s', a')}_{\text{target}}\)
\(\delta \;=\; \bigl(R(s) + \gamma\, \max_{a'} Q(s', a')\bigr) - Q(s, a)\)
How far the current estimate \(Q(s,a)\) is from the one-step target.
One sample is noisy, so don't jump all the way — nudge \(Q(s,a)\) toward the target by a fraction \(\alpha\) of the gap \(\delta\):
\(Q(s, a) \;\leftarrow\; Q(s, a) + \alpha\, \delta\)
\(=\; Q(s, a) + \alpha\bigl(\, \underbrace{R(s) + \gamma \max_{a'} Q(s', a')}_{\text{target}} - Q(s, a)\,\bigr)\)
\(\alpha \to 0\): barely move; \(\alpha = 1\): jump straight to the target. This update rule is Q-learning.
Model-free — it never estimates \(P\); converges to \(Q^\star\) if every \((s,a)\) is visited infinitely often.
The algorithm works for any fixed \(\alpha \in (0, 1]\), but a good schedule lets \(\alpha\) shrink as a state–action pair is visited more often:
\(\alpha(N) = \dfrac{10}{9 + N(s,a)}\)
3x4 grid. Experience: \(\langle s_{32}, \text{right}, s_{33}, -0.04 \rangle\). Take \(\gamma = 1, \alpha = 0.1\).
| \(s_{32}\) | \(s_{33}\) | |
|---|---|---|
| up | 0.55 | 0.4 |
| right | 0.7 | 0.9 |
| down | 0.5 | 0.8 |
| left | 0.4 | 0.5 |
Yellow cell = the one we update; green = the max-value action in \(s_{33}\).
Observe a 5-tuple \(\langle s, a, r, s', a' \rangle\) where \(a'\) is the action actually taken in \(s'\) (per the current policy):
\(Q(s, a) \leftarrow Q(s, a) + \alpha \bigl(r + \gamma\, \max_{a'} Q(s', a') - Q(s, a)\bigr)\)
Target uses the greedy next action. Learns about the optimal policy regardless of what's actually played.
\(Q(s, a) \leftarrow Q(s, a) + \alpha \bigl(r + \gamma\, Q(s', a') - Q(s, a)\bigr)\)
Target uses the actually-played next action \(a'\). Learns about the policy the agent is following (exploration and all).
SARSA = State, Action, Reward, State, Action.
Every step, the agent runs two policies at once:
The update uses \(\max_{a'} Q(s',a')\) — the value of the greedy next action — even when the agent is about to explore a different one.
Behavior \(\ne\) target \(\Rightarrow\) off-policy. Converges to \(Q^\star\) no matter how it explores.
The update uses \(Q(s',a')\) for the action actually chosen by \(\varepsilon\)-greedy — the very move it will play next.
Behavior \(=\) target \(\Rightarrow\) on-policy. Learns the value of the exploring policy itself.
Off-policy: learn about one policy (greedy) while behaving by another (exploratory). On-policy: learn about the policy you actually run.
Update \(Q(s_{13}, \text{down}) = 0.4\) after \(\langle s_{13}, \text{down}, s_{23} \rangle\), with \(r = -0.04,\ \gamma = 0.9,\ \alpha = 0.4\).
The \(\varepsilon\)-greedy step explores \(a' = \text{right}\) in \(s_{23}\) (into the bad cell).
| \(Q(s_{23}, \cdot)\) | up | right | down | left |
|---|---|---|---|---|
| 0.3 | −0.4 | 0.7 | 0.3 |
Yellow = the explored next action \(a' = \text{right}\) (SARSA's target); green = \(\max_{a'} Q\) (Q-learning's target).
The agent is about to play \(a' = \text{right}\) (\(-0.4\)), but Q-learning updates toward \(\max = 0.7\) (down) — not the move it will make. That mismatch is "off-policy."
SARSA is conservative (learns what really happens); Q-learning is aggressive (learns the optimal policy).
Q2. For a high-stakes online learning problem (for example, a self-driving car still in training), which algorithm is safer?
Q3. For each of Q-learning and SARSA: is it model-based or model-free? Is it on-policy or off-policy?
| ADP | Q-learning | SARSA | |
|---|---|---|---|
| Learns transition \(P\)? | yes (model-based) | no (model-free) | no (model-free) |
| On / off policy | n/a (eval-only) | off-policy | on-policy |
| Computation per experience | heavy (solve Bellman) | cheap (one TD update) | cheap (one TD update) |
| Convergence speed | fast (few experiences) | slower, higher variance | slower, higher variance |
ADP is sample-efficient but expensive per step; TD methods (Q, SARSA) flip that trade-off.
RL was learning from rewards. The last module turns to learning from data.