Explain the exploration vs exploitation trade-off.
Trace + implement active ADP.
Trace + implement Q-learning.
Trace + implement SARSA.
Tell on-policy from off-policy learning.
The RL setting: no \(P\), no \(R\) up front
An MDP whose transition \(P\) and reward \(R\) are unknown. The agent sees state \(s\) and reward \(r\), picks an action \(a\), and observes \(s'\). Goal: maximize discounted return.
Three new challenges:
Credit assignment: which past action caused this reward?
Long-term impact: how will this action affect future returns?
Explore vs exploit: try new actions or stick with the best known?
Passive learning: just evaluate a fixed policy
Fix a policy \(\pi\). Goal: learn \(V^\pi(s)\) for every state \(s\).
Adaptive Dynamic Programming (ADP)
Learn the transition \(P(s' \mid s, a)\) and reward \(R(s)\) from observed experience.
Q1. In a 3x4 grid, suppose \(N(s_{23}, \text{down}) = 23\) and \(N(s_{23}, \text{down}, s_{24}) = 2\). The agent tries to move down but slips and lands in \(s_{24}\). What is the updated \(P(s_{24} \mid s_{23}, \text{down})\)?
\(2/23 \approx 0.087\)
\(3/23 \approx 0.130\)
\(3/24 = 0.125\)
\(2/24 \approx 0.083\)
C — 0.125. First increment the counts: \(N(s_{23}, \text{down}) \to 24\) and \(N(s_{23}, \text{down}, s_{24}) \to 3\). Then \(P = 3/24 = 0.125\).
Active learning: explore vs exploit
Now the agent chooses its actions. Two competing pressures:
Exploit
Take the action that maximizes the current \(V(s)\) (or \(Q(s, a)\)). Reap known rewards.
Explore
Try a different action. Gather data to learn a better model / better Q-values.
A purely greedy agent seldom converges to the optimal policy — the learned model is biased toward states the agent has already preferred.
Three exploration strategies
\(\varepsilon\)-greedy
Random with prob \(\varepsilon\); greedy with prob \(1 - \varepsilon\). Decay \(\varepsilon\) over time.
Simple, but treats all sub-optimal actions equally.
Softmax / Boltzmann
\(P(a) = \dfrac{e^{Q(s, a) / T}}{\sum_{a'} e^{Q(s, a') / T}}\)
Low \(T\): peaky / greedy. High \(T\): uniform / exploratory.
Optimistic init
\(f(u, n) = \begin{cases} R^+ & n < N_e \\ u & \text{else}\end{cases}\)
"Shiny new toy" bonus until each action has been tried \(N_e\) times.
SARSA is conservative (learns what really happens); Q-learning is aggressive (learns the optimal policy).
ADP vs Q-learning vs SARSA
ADP
Q-learning
SARSA
Learns transition \(P\)?
yes (model-based)
no (model-free)
no (model-free)
On / off policy
n/a (eval-only)
off-policy
on-policy
Computation per experience
heavy (solve Bellman)
cheap (one TD update)
cheap (one TD update)
Convergence speed
fast (few experiences)
slower, higher variance
slower, higher variance
ADP is sample-efficient but expensive per step; TD methods (Q, SARSA) flip that trade-off.
When to use what (Q2)
Q2. For a high-stakes online learning problem (for example, a self-driving car still in training), which algorithm is safer?
Q-learning
SARSA
I don't know
B — SARSA. SARSA's target uses the action that will actually be taken while exploring, so it learns to avoid catastrophic outcomes along the exploration path. Q-learning's target uses the greedy action, ignoring the cost of exploration mistakes.
Putting the labels together (Q3)
Q3. Which statement is true?
Q-learning is model-free + off-policy; SARSA is model-based + on-policy.
Q-learning is model-based + off-policy; SARSA is model-free + on-policy.
Q-learning is model-free + on-policy; SARSA is model-based + off-policy.
Q-learning is model-based + on-policy; SARSA is model-free + off-policy.
None of the above.
E — None of the above. Both Q-learning and SARSA are model-free. Q-learning is off-policy (target uses \(\max_{a'} Q\)); SARSA is on-policy (target uses the action actually played).
Learning goals (recap)
✓ Trace + implement passive ADP.
✓ Explain the exploration vs exploitation trade-off.
✓ Trace + implement active ADP.
✓ Trace + implement Q-learning.
✓ Trace + implement SARSA.
✓ Tell on-policy from off-policy.
Search›Uncertainty›Decisions›Learning
Next module: Machine Learning and Deep Learning
RL was learning from rewards. The last module turns to learning from data.