CS 486/686
Neural Networks III
Lecture 21
Optimization for deep learning · slides adapted from CMSC 35246 (Shubhendu & Risi, U. Chicago)
Search
›
Uncertainty
›
Decisions
›
Learning
Learning goals
- Stochastic gradient descent: noisy steps, learning-rate schedules, convergence.
- Momentum and its Nesterov look-ahead variant.
- Adaptive learning rates: AdaGrad and RMSProp.
- Adam: adaptive moments combining the two ideas.
Roadmap: first-order optimizers
Backprop gives us \(\nabla_\theta L\). The optimizer decides what to do with it.
GD→
SGD→
+ Momentum→
+ Nesterov
→
AdaGrad→
RMSProp→
Adam
- SGD: trade accuracy of gradient for speed of updates.
- Momentum: add inertia — escape ravines, damp oscillations.
- Adaptive: give each parameter its own learning rate.
- Adam: combine momentum + adaptive in one optimizer.
SGD recap
Batch GD:
\( \hat g \leftarrow \tfrac{1}{N}\, \nabla_\theta \sum_i L(f(x^{(i)}; \theta), y^{(i)}) \)
SGD (one example per step):
\( \hat g \leftarrow \nabla_\theta L(f(x^{(i)}; \theta), y^{(i)}) \)
In both: \( \theta \leftarrow \theta - \varepsilon \hat g \).
Decay the learning rate linearly until iteration \(\tau\):
\( \varepsilon_k = (1-\alpha)\,\varepsilon_0 + \alpha\,\varepsilon_\tau, \quad \alpha = k/\tau \)
Sufficient for convergence:
\( \displaystyle\sum_k \varepsilon_k = \infty, \quad \sum_k \varepsilon_k^2 < \infty \)
Mini-batch SGD (e.g. 32, 64): cheaper than full-batch, less noisy than per-example, scales to huge datasets.
What goes wrong with plain SGD
Local minima. Non-convex losses have many. SGD shrinks gradients to zero there and stalls.
Noisy oscillations. Each step uses only one example — the descent path zig-zags, slowing convergence.
Both issues motivate the next big idea: momentum.
Momentum: add inertia
Introduce a velocity vector \(\bm v\) — an exponentially weighted average of past gradients:
\[ \bm v \leftarrow \alpha\, \bm v - \varepsilon\, \nabla_\theta L(\theta), \qquad \theta \leftarrow \theta + \bm v \]
- \(\alpha \in [0, 1)\) controls how much past gradients persist. Common: \(\alpha = 0.9\).
- Recent gradients dominate — older ones decay geometrically.
- Intuition: a ball rolling down a hill keeps moving in the average direction, smoothing oscillations and powering through small bumps.
Worked example
Minimize \( y = 0.3x^4 - 0.1x^3 - 2x^2 - 0.8x \), with gradient \( g = 1.2x^3 - 0.3x^2 - 4x - 0.8 \). Start at \(x_0 = -2\).
SGD (no momentum)
| iter | \(g_i\) | \(\theta_i\) |
| 1 | -18.2 | -1.885 |
| 2 | -2.36 | -1.76 |
| 3 | -1.2 | -1.70 |
| 4 | -0.78 | -1.66 |
| ... | ... | ... |
| 99 | 0.0002 | -1.586 |
stuck in local min
With momentum (\(\alpha=0.9\))
| iter | \(g_i\) | \(v_i\) | \(\theta_i\) |
| 1 | -18.2 | 0 | -1.885 |
| 2 | -2.36 | -15.1 | -1.12 |
| 3 | 1.61 | -9.0 | -0.67 |
| 4 | 1.39 | -9.2 | -0.21 |
| ... | ... | ... | ... |
| 99 | 0.0002 | 0.0003 | 2.042 |
reaches global min
Velocity carries the iterate over the local-min barrier near \(x \approx -1.6\).
How big is one momentum step?
Plain SGD takes a step of size \(\varepsilon\,\lVert g\rVert\). With momentum, recent gradients compound:
\[ \text{step} \approx \varepsilon\lVert g_1\rVert + \alpha\,\varepsilon\lVert g_2\rVert + \alpha^2\,\varepsilon\lVert g_3\rVert + \cdots \approx \dfrac{\varepsilon\,\lVert \hat g\rVert}{1 - \alpha} \]
Concretely: \(\alpha = 0.9 \Rightarrow\) effective step size is up to 10\(\times\) the per-step SGD step in a consistent direction.
In oscillating directions, opposite gradients cancel in \(\bm v\) — momentum naturally damps noise.
Nesterov momentum: look ahead first
Idea: step in the direction of accumulated momentum, then evaluate the gradient there as a correction.
Standard momentum:
\( \bm v \leftarrow \alpha \bm v - \varepsilon\, \nabla_\theta L(\theta) \)
Nesterov momentum:
\( \bm v \leftarrow \alpha \bm v - \varepsilon\, \nabla_\theta L({\color{#dc2626}\theta + \alpha \bm v}) \)
\( \theta \leftarrow \theta + \bm v \)
Gradient evaluated at the look-ahead point gives a better local correction.
Why one learning rate isn't enough
Features differ in scale and frequency. A single \(\varepsilon\) is too small for flat directions, too large for steep ones.
Give each parameter its own learning rate, scaled by its gradient history.
AdaGrad: accumulate squared gradients
Per-parameter learning rate, scaled by the running sum of squared gradients.
AdaGrad
Require: global LR \(\varepsilon\), small \(\delta\); init \(\bm r = 0\).
- Sample example \((x^{(i)}, y^{(i)})\); compute \(\hat g \leftarrow \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})\).
- Accumulate: \(\bm r \leftarrow \bm r + \hat g \odot \hat g\)
- Update: \(\Delta\theta \leftarrow -\dfrac{\varepsilon}{\delta + \sqrt{\bm r}} \odot \hat g\)
- Apply: \(\theta \leftarrow \theta + \Delta\theta\)
Catch: \(\bm r\) only grows — the effective LR shrinks to zero, training stalls in non-convex settings.
RMSProp: forget old gradients
Fix AdaGrad's shrink by replacing the running sum with an exponentially decaying moving average.
RMSProp
Require: LR \(\varepsilon\), decay \(\rho\) (typically 0.9), small \(\delta\); init \(\bm r = 0\).
- Compute gradient: \(\hat g \leftarrow \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})\).
- Decay+accumulate: \(\bm r \leftarrow \rho\, \bm r + (1-\rho)\, \hat g \odot \hat g\)
- Update: \(\Delta\theta \leftarrow -\dfrac{\varepsilon}{\delta + \sqrt{\bm r}} \odot \hat g\)
- Apply: \(\theta \leftarrow \theta + \Delta\theta\)
Related: AdaDelta goes one step further and removes the global LR \(\varepsilon\) by also tracking \(\mathbb{E}[\Delta\theta^2]\).
Adam: ADAptive Moments
Track first moment (mean of \(\hat g\)) like momentum, and second moment (mean of \(\hat g^2\)) like RMSProp.
Adam
Require: LR \(\varepsilon\), decay rates \(\rho_1, \rho_2\) (typically 0.9, 0.999), small \(\delta\); init \(\bm s = 0\), \(\bm r = 0\), \(t = 0\).
- Compute gradient \(\hat g\); increment \(t\).
- 1st moment: \(\bm s \leftarrow \rho_1\, \bm s + (1-\rho_1)\, \hat g\)
- 2nd moment: \(\bm r \leftarrow \rho_2\, \bm r + (1-\rho_2)\, \hat g \odot \hat g\)
- Bias correction: \(\hat{\bm s} \leftarrow \dfrac{\bm s}{1 - \rho_1^t},\quad \hat{\bm r} \leftarrow \dfrac{\bm r}{1 - \rho_2^t}\)
- Update: \(\Delta\theta = -\varepsilon\, \dfrac{\hat{\bm s}}{\sqrt{\hat{\bm r}} + \delta}\)
- Apply: \(\theta \leftarrow \theta + \Delta\theta\)
Bias correction fixes the cold-start: early \(\bm s, \bm r\) are biased toward 0 because they start at 0.
Optimizer cheat sheet
| Method |
Update direction |
Per-param LR? |
When to use |
| SGD | \(-\varepsilon\, \hat g\) | No | Simple losses; baseline. |
| + Momentum | \(\bm v = \alpha\bm v - \varepsilon\hat g\) | No | Noisy gradients, ravines, local minima. |
| RMSProp | \(-\dfrac{\varepsilon}{\sqrt{\bm r}+\delta} \hat g\) | Yes | RNNs, non-stationary or sparse gradients. |
| Adam | \(-\varepsilon\, \dfrac{\hat{\bm s}}{\sqrt{\hat{\bm r}}+\delta}\) | Yes | Default choice in deep learning today. |
Tune the learning rate first; pick the optimizer second.
Revisiting learning goals
- Stochastic gradient descent: noisy steps, learning-rate schedules, convergence.
- Momentum and its Nesterov look-ahead variant.
- Adaptive learning rates: AdaGrad and RMSProp.
- Adam: adaptive moments combining the two ideas.
Search
›
Uncertainty
›
Decisions
›
Learning
›
Recap
Next: L22 — course recap & exam prep
- A tour of everything we learned: search, CSPs, probabilistic reasoning, decision theory, MDPs, RL, ML.
- How the pieces fit into modern AI systems.
- Pointers to follow-up courses (CS480/680, CS485) and open research.
- Final exam logistics: what to bring, format, and a notation cheat sheet for your study card.
Thanks for a great term — see you on Tuesday!