Course page
CS 486/686
Neural Networks III

Lecture 21

Optimization for deep learning · slides adapted from CMSC 35246 (Shubhendu & Risi, U. Chicago)

Search Uncertainty Decisions Learning

Learning goals

  • Stochastic gradient descent: noisy steps, learning-rate schedules, convergence.
  • Momentum and its Nesterov look-ahead variant.
  • Adaptive learning rates: AdaGrad and RMSProp.
  • Adam: adaptive moments combining the two ideas.

Roadmap: first-order optimizers

Backprop gives us \(\nabla_\theta L\). The optimizer decides what to do with it.

GD SGD + Momentum + Nesterov AdaGrad RMSProp Adam
  • SGD: trade accuracy of gradient for speed of updates.
  • Momentum: add inertia — escape ravines, damp oscillations.
  • Adaptive: give each parameter its own learning rate.
  • Adam: combine momentum + adaptive in one optimizer.

SGD recap

Batch GD:

\( \hat g \leftarrow \tfrac{1}{N}\, \nabla_\theta \sum_i L(f(x^{(i)}; \theta), y^{(i)}) \)

SGD (one example per step):

\( \hat g \leftarrow \nabla_\theta L(f(x^{(i)}; \theta), y^{(i)}) \)

In both: \( \theta \leftarrow \theta - \varepsilon \hat g \).

Decay the learning rate linearly until iteration \(\tau\):

\( \varepsilon_k = (1-\alpha)\,\varepsilon_0 + \alpha\,\varepsilon_\tau, \quad \alpha = k/\tau \)

Sufficient for convergence:

\( \displaystyle\sum_k \varepsilon_k = \infty, \quad \sum_k \varepsilon_k^2 < \infty \)

Mini-batch SGD (e.g. 32, 64): cheaper than full-batch, less noisy than per-example, scales to huge datasets.

What goes wrong with plain SGD

L \(\theta\) local min global min oscillating + stuck

Local minima. Non-convex losses have many. SGD shrinks gradients to zero there and stalls.

Noisy oscillations. Each step uses only one example — the descent path zig-zags, slowing convergence.

Both issues motivate the next big idea: momentum.

Momentum: add inertia

Introduce a velocity vector \(\bm v\) — an exponentially weighted average of past gradients:

\[ \bm v \leftarrow \alpha\, \bm v - \varepsilon\, \nabla_\theta L(\theta), \qquad \theta \leftarrow \theta + \bm v \]
  • \(\alpha \in [0, 1)\) controls how much past gradients persist. Common: \(\alpha = 0.9\).
  • Recent gradients dominate — older ones decay geometrically.
  • Intuition: a ball rolling down a hill keeps moving in the average direction, smoothing oscillations and powering through small bumps.

Worked example

Minimize \( y = 0.3x^4 - 0.1x^3 - 2x^2 - 0.8x \), with gradient \( g = 1.2x^3 - 0.3x^2 - 4x - 0.8 \). Start at \(x_0 = -2\).

SGD (no momentum)

iter\(g_i\)\(\theta_i\)
1-18.2-1.885
2-2.36-1.76
3-1.2-1.70
4-0.78-1.66
.........
990.0002-1.586

stuck in local min

With momentum (\(\alpha=0.9\))

iter\(g_i\)\(v_i\)\(\theta_i\)
1-18.20-1.885
2-2.36-15.1-1.12
31.61-9.0-0.67
41.39-9.2-0.21
............
990.00020.00032.042

reaches global min

Velocity carries the iterate over the local-min barrier near \(x \approx -1.6\).

How big is one momentum step?

Plain SGD takes a step of size \(\varepsilon\,\lVert g\rVert\). With momentum, recent gradients compound:

\[ \text{step} \approx \varepsilon\lVert g_1\rVert + \alpha\,\varepsilon\lVert g_2\rVert + \alpha^2\,\varepsilon\lVert g_3\rVert + \cdots \approx \dfrac{\varepsilon\,\lVert \hat g\rVert}{1 - \alpha} \]
Concretely: \(\alpha = 0.9 \Rightarrow\) effective step size is up to 10\(\times\) the per-step SGD step in a consistent direction.

In oscillating directions, opposite gradients cancel in \(\bm v\) — momentum naturally damps noise.

Nesterov momentum: look ahead first

Idea: step in the direction of accumulated momentum, then evaluate the gradient there as a correction.

\(\theta\) accumulated \(\alpha\bm v\) correction \(-\varepsilon\nabla L(\theta+\alpha\bm v)\) new step: \(\bm v_{\text{new}}\)

Standard momentum:

\( \bm v \leftarrow \alpha \bm v - \varepsilon\, \nabla_\theta L(\theta) \)

Nesterov momentum:

\( \bm v \leftarrow \alpha \bm v - \varepsilon\, \nabla_\theta L({\color{#dc2626}\theta + \alpha \bm v}) \)

\( \theta \leftarrow \theta + \bm v \)

Gradient evaluated at the look-ahead point gives a better local correction.

Why one learning rate isn't enough

Features differ in scale and frequency. A single \(\varepsilon\) is too small for flat directions, too large for steep ones.

isotropic (easy)
anisotropic (zig-zag)

Give each parameter its own learning rate, scaled by its gradient history.

AdaGrad: accumulate squared gradients

Per-parameter learning rate, scaled by the running sum of squared gradients.

AdaGrad
Require: global LR \(\varepsilon\), small \(\delta\); init \(\bm r = 0\).
  1. Sample example \((x^{(i)}, y^{(i)})\); compute \(\hat g \leftarrow \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})\).
  2. Accumulate: \(\bm r \leftarrow \bm r + \hat g \odot \hat g\)
  3. Update: \(\Delta\theta \leftarrow -\dfrac{\varepsilon}{\delta + \sqrt{\bm r}} \odot \hat g\)
  4. Apply: \(\theta \leftarrow \theta + \Delta\theta\)

Catch: \(\bm r\) only grows — the effective LR shrinks to zero, training stalls in non-convex settings.

RMSProp: forget old gradients

Fix AdaGrad's shrink by replacing the running sum with an exponentially decaying moving average.

RMSProp
Require: LR \(\varepsilon\), decay \(\rho\) (typically 0.9), small \(\delta\); init \(\bm r = 0\).
  1. Compute gradient: \(\hat g \leftarrow \nabla_\theta L(f(x^{(i)};\theta), y^{(i)})\).
  2. Decay+accumulate: \(\bm r \leftarrow \rho\, \bm r + (1-\rho)\, \hat g \odot \hat g\)
  3. Update: \(\Delta\theta \leftarrow -\dfrac{\varepsilon}{\delta + \sqrt{\bm r}} \odot \hat g\)
  4. Apply: \(\theta \leftarrow \theta + \Delta\theta\)

Related: AdaDelta goes one step further and removes the global LR \(\varepsilon\) by also tracking \(\mathbb{E}[\Delta\theta^2]\).

Adam: ADAptive Moments

Track first moment (mean of \(\hat g\)) like momentum, and second moment (mean of \(\hat g^2\)) like RMSProp.

Adam
Require: LR \(\varepsilon\), decay rates \(\rho_1, \rho_2\) (typically 0.9, 0.999), small \(\delta\); init \(\bm s = 0\), \(\bm r = 0\), \(t = 0\).
  1. Compute gradient \(\hat g\); increment \(t\).
  2. 1st moment: \(\bm s \leftarrow \rho_1\, \bm s + (1-\rho_1)\, \hat g\)
  3. 2nd moment: \(\bm r \leftarrow \rho_2\, \bm r + (1-\rho_2)\, \hat g \odot \hat g\)
  4. Bias correction: \(\hat{\bm s} \leftarrow \dfrac{\bm s}{1 - \rho_1^t},\quad \hat{\bm r} \leftarrow \dfrac{\bm r}{1 - \rho_2^t}\)
  5. Update: \(\Delta\theta = -\varepsilon\, \dfrac{\hat{\bm s}}{\sqrt{\hat{\bm r}} + \delta}\)
  6. Apply: \(\theta \leftarrow \theta + \Delta\theta\)

Bias correction fixes the cold-start: early \(\bm s, \bm r\) are biased toward 0 because they start at 0.

Optimizer cheat sheet

Method Update direction Per-param LR? When to use
SGD\(-\varepsilon\, \hat g\)NoSimple losses; baseline.
+ Momentum\(\bm v = \alpha\bm v - \varepsilon\hat g\)NoNoisy gradients, ravines, local minima.
RMSProp\(-\dfrac{\varepsilon}{\sqrt{\bm r}+\delta} \hat g\)YesRNNs, non-stationary or sparse gradients.
Adam\(-\varepsilon\, \dfrac{\hat{\bm s}}{\sqrt{\hat{\bm r}}+\delta}\)YesDefault choice in deep learning today.

Tune the learning rate first; pick the optimizer second.

Revisiting learning goals

  • Stochastic gradient descent: noisy steps, learning-rate schedules, convergence.
  • Momentum and its Nesterov look-ahead variant.
  • Adaptive learning rates: AdaGrad and RMSProp.
  • Adam: adaptive moments combining the two ideas.
Search Uncertainty Decisions Learning Recap

Next: L22 — course recap & exam prep

  • A tour of everything we learned: search, CSPs, probabilistic reasoning, decision theory, MDPs, RL, ML.
  • How the pieces fit into modern AI systems.
  • Pointers to follow-up courses (CS480/680, CS485) and open research.
  • Final exam logistics: what to bring, format, and a notation cheat sheet for your study card.

Thanks for a great term — see you on Tuesday!