Course page PDF
CS 486/686
Neural Networks II: Training

Lecture 20

RN 21.1–21.2 · PM 7.5 · GBC 6.5, 7, 8

Search Uncertainty Decisions Learning

Learning goals

  • Write the cross-entropy loss and the gradient-descent update.
  • Describe back-propagation (forward + backward passes) and compute a weight's gradient.
  • Speed up training with momentum and Adam / AdamW.
  • Fight overfitting with regularization, dropout, and normalization.
  • Decide when to use a neural network vs a decision tree.

The training problem

A 2-layer feedforward network for spam classification — predict \([a^{(2)}_1, a^{(2)}_2]\) from two features (email length, sender trust).

x1 x2 z1(1) z2(1) z1(2) z2(2) inputs hidden outputs W(1) W(2)

Softmax outputs a probability over "spam" vs "ham":

  • Spam → one-hot target \([1, 0]\)
  • Ham → one-hot target \([0, 1]\)

We have paired data \((x_1, x_2, y)\). How do we set the weights \(W^{(1)}, W^{(2)}\)?

Loss function and gradient descent

For classification we use the cross-entropy loss from L16 (softmax output \(\hat{y}^{(i)}\), one-hot target \(y^{(i)}\)):

\[ L(W) = -\sum_{i=1}^{N} \sum_{k} y^{(i)}_k \, \log \hat{y}^{(i)}_k \]

(For regression, swap in squared error \(\tfrac12\sum_i\lVert\hat y^{(i)}-y^{(i)}\rVert^2\).)

Gradient descent: repeat for many iterations:

  1. Compute the gradient \(\nabla_W L(W) = \big[ \partial L / \partial w \big]_{w \in W}\).
  2. Update every weight: \( w \leftarrow w - \eta\, \dfrac{\partial L}{\partial w} \).

\(\eta > 0\) is the learning rate (step size).

Why this works

Direction: steepest descent

\(\nabla_W L\) points in the direction of fastest increase of \(L\).

Moving in \(-\nabla_W L\) decreases \(L\) the fastest.

Step size: \(\eta\) trade-off

Too small \(\Rightarrow\) very slow learning.

Too large \(\Rightarrow\) overshoots the minimum, may diverge.

loss surface good eta eta too large

Three flavors of gradient descent

Batch GD

Gradient uses all \(N\) examples per step.

Stable, accurate gradient.

Slow for large \(N\).

Stochastic GD

One example per step.

Cheap; noise helps escape local minima.

Noisy updates.

Mini-batch GD

A small batch (e.g. 32, 64) per step.

Best of both worlds; GPU-friendly.

Standard choice today.

First, how do we get the gradient efficiently? Then how do we take smarter steps (momentum, Adam)?

Back-propagation: efficient gradients

Two passes per training example \((x, y)\):

→ Forward pass

Push inputs through layers, compute every \(a^{(\ell)}, z^{(\ell)}\), and the loss \(E\).

← Backward pass

Propagate \(\partial E\) back through the layers, compute every \(\partial E / \partial W^{(\ell)}\).

x1 x2 z1(1) z2(1) 1 2 forward (compute E) backward (compute ∂E/∂W)

Forward pass

Compute pre-activations \(a^{(\ell)}\) and activations \(z^{(\ell)} = g(a^{(\ell)})\) layer by layer.

\[ a^{(1)}_j = \sum_i x_i\, W^{(1)}_{i,j}, \qquad z^{(1)}_j = g(a^{(1)}_j) \]
\[ a^{(2)}_k = \sum_j z^{(1)}_j\, W^{(2)}_{j,k}, \qquad z^{(2)}_k = g(a^{(2)}_k) \]
\[ E = E\!\left(z^{(2)}, y\right) \]

Cache every \(a^{(\ell)}\) and \(z^{(\ell)}\) — we'll need them on the way back.

Backward pass

Walk back through the network, computing local errors \(\delta\) via the chain rule.

Layer 2 (output side):

\[ \dfrac{\partial E}{\partial W^{(2)}_{j,k}} = \delta^{(2)}_k\, z^{(1)}_j, \qquad \delta^{(2)}_k = \dfrac{\partial E}{\partial z^{(2)}_k}\, g'\!\bigl(a^{(2)}_k\bigr) \]

Layer 1 (propagate back):

\[ \dfrac{\partial E}{\partial W^{(1)}_{i,j}} = \delta^{(1)}_j\, x_i, \qquad \delta^{(1)}_j = \Bigl( \sum_k \delta^{(2)}_k\, W^{(2)}_{j,k} \Bigr)\, g'\!\bigl(a^{(1)}_j\bigr) \]

Pattern:   gradient = local error \(\times\) input to that weight.

The recursive structure of \(\delta\)

For unit \(j\) in layer \(\ell\), define \(\delta^{(\ell)}_j = \dfrac{\partial E}{\partial a^{(\ell)}_j}\). Then:

\[ \delta^{(\ell)}_j = \begin{cases} \dfrac{\partial E}{\partial z^{(\ell)}_j}\, g'(a^{(\ell)}_j), & \text{output unit (base)} \\[0.8em] \Bigg( \displaystyle \sum_k \delta^{(\ell+1)}_k\, W^{(\ell+1)}_{j,k} \Bigg)\, g'(a^{(\ell)}_j), & \text{hidden unit (recursive)} \end{cases} \]
j (layer \(\ell\)) \(\delta^{(\ell)}_j\) W(\(\ell\)+1)j,1 W(\(\ell\)+1)j,2 k=1 k=2 \(\delta^{(\ell+1)}_1\) \(\delta^{(\ell+1)}_2\) downstream layer \(\ell\!+\!1\)

\(\delta\) for a hidden unit = weighted sum of downstream \(\delta\)s, modulated by \(g'\).

Backprop in matrix form

Stack layer activations into vectors. Let \(\delta_\ell = \partial E / \partial z^{(\ell)}\) and \(W_\ell\) be the weight matrix.

Algorithm.
  1. Initialize weights \(W_\ell\) for every layer.
  2. Forward: push \(x\) through, cache \(z^{(1)}, z^{(2)}, \ldots\)
  3. Output \(\delta\): set \(\delta_n = \partial E / \partial z^{(n)}\).
  4. Backward sweep, for \(\ell = n, n\!-\!1, \ldots, 1\):
    \(\delta_{\ell-1} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot W_\ell\)
    \(\dfrac{\partial E}{\partial W_\ell} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot z^{(\ell-1)}\)
  5. Plug \(\partial E / \partial W_\ell\) into gradient descent.

The sigmoid derivative trick

For \(g(x) = \dfrac{1}{1 + e^{-x}}\):

\[ g'(x) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = g(x)\,\big(1 - g(x)\big) \]
Why it matters: during the forward pass we already compute \(g(a)\). Reuse it — no need to redo the exponential during the backward pass.

Similar shortcuts exist for tanh (\(1 - g^2\)) and ReLU (0 or 1) — ReLU's constant gradient is a big reason it beats sigmoid in deep nets.

Beyond plain SGD: momentum and Adam

Plain SGD is noisy and slow in ravines. Two ideas fix it:

Momentum

Accumulate a velocity (an EWMA of past gradients) so consistent directions build speed:

\(v \leftarrow \beta v - \eta\,\nabla_W L, \quad W \leftarrow W + v\)

Adam

Per-parameter step sizes: divide by a running estimate of each gradient's magnitude, plus momentum. Robust default:

\(W \leftarrow W - \eta\,\dfrac{\hat{m}}{\sqrt{\hat{v}}+\epsilon}\)

AdamW (Adam + decoupled weight decay) is the standard optimizer for transformers/LLMs today.  [optional deep-dive deck]

Making training generalize & stay stable

Regularization (fight overfitting)

  • Weight decay: add \(\lambda\lVert W\rVert^2\) to the loss (L16).
  • Dropout: randomly zero some units each step so the net can't rely on any one.
  • Early stopping: halt when validation loss turns up.

Normalization & schedules (stay stable)

  • Batch / layer norm: re-center & scale activations so gradients stay well-behaved.
  • LR warmup + cosine decay: ramp \(\eta\) up, then anneal it down.
  • Gradient clipping: cap the gradient norm to avoid blow-ups.

These are the everyday knobs behind training any modern deep network.

When to use (or not use) a neural network

Reach for an NN when
  • High-dimensional, real-valued, or noisy inputs.
  • Target function form is unknown (no good hand-crafted model).
  • Interpretability is not a priority.
  • Plenty of training data is available.
Avoid an NN when
  • Architecture is hard to choose (layers, units, activations).
  • Weights need to be inspected and explained.
  • Data is tabular and small — overfitting risk is high.

Neural network vs decision tree

Neural network Decision tree
Data typeImages, audio, textTabular data (gradient boosting usually wins here)
Data sizeNeeds a lot; easy to overfit small dataWorks with very little data
Target functionArbitrary functionsNested if/else rules
ArchitectureLayers, units, activations, init, \(\eta\) all criticalA few hyperparameters (depth, pruning)
InterpretabilityBlack boxEasy to explain to humans
Train / inferenceSlow to train and runFast

Revisiting learning goals

  • Write the cross-entropy loss and the gradient-descent update.
  • Describe back-propagation and compute a weight's gradient.
  • Speed up training with momentum and Adam / AdamW.
  • Fight overfitting with regularization, dropout, and normalization.
  • Decide when to use a neural network vs a decision tree.

Next: Transformers & Attention

We can now build and train deep nets. But how do we handle sequences — language, code, audio — at scale?

  • Why RNNs struggle with long sequences.
  • Self-attention: the idea behind every modern LLM.
  • The transformer block, end to end.

L21: Transformers & Attention.