CS 486/686
Neural Networks II: Training
Lecture 20
RN 21.1–21.2 · PM 7.5 · GBC 6.5, 7, 8
Search
›
Uncertainty
›
Decisions
›
Learning
Learning goals
- Write the cross-entropy loss and the gradient-descent update.
- Describe back-propagation (forward + backward passes) and compute a weight's gradient.
- Speed up training with momentum and Adam / AdamW.
- Fight overfitting with regularization, dropout, and normalization.
- Decide when to use a neural network vs a decision tree.
The training problem
A 2-layer feedforward network for spam classification — predict \([a^{(2)}_1, a^{(2)}_2]\) from two features (email length, sender trust).
Softmax outputs a probability over "spam" vs "ham":
- Spam → one-hot target \([1, 0]\)
- Ham → one-hot target \([0, 1]\)
We have paired data \((x_1, x_2, y)\). How do we set the weights \(W^{(1)}, W^{(2)}\)?
Loss function and gradient descent
For classification we use the cross-entropy loss from L16 (softmax output \(\hat{y}^{(i)}\), one-hot target \(y^{(i)}\)):
\[ L(W) = -\sum_{i=1}^{N} \sum_{k} y^{(i)}_k \, \log \hat{y}^{(i)}_k \]
(For regression, swap in squared error \(\tfrac12\sum_i\lVert\hat y^{(i)}-y^{(i)}\rVert^2\).)
Gradient descent: repeat for many iterations:
- Compute the gradient \(\nabla_W L(W) = \big[ \partial L / \partial w \big]_{w \in W}\).
- Update every weight: \( w \leftarrow w - \eta\, \dfrac{\partial L}{\partial w} \).
\(\eta > 0\) is the learning rate (step size).
Why this works
Direction: steepest descent
\(\nabla_W L\) points in the direction of fastest increase of \(L\).
Moving in \(-\nabla_W L\) decreases \(L\) the fastest.
Step size: \(\eta\) trade-off
Too small \(\Rightarrow\) very slow learning.
Too large \(\Rightarrow\) overshoots the minimum, may diverge.
Three flavors of gradient descent
Batch GD
Gradient uses all \(N\) examples per step.
Stable, accurate gradient.
Slow for large \(N\).
Stochastic GD
One example per step.
Cheap; noise helps escape local minima.
Noisy updates.
Mini-batch GD
A small batch (e.g. 32, 64) per step.
Best of both worlds; GPU-friendly.
Standard choice today.
First, how do we get the gradient efficiently? Then how do we take smarter steps (momentum, Adam)?
Back-propagation: efficient gradients
Two passes per training example \((x, y)\):
→ Forward pass
Push inputs through layers, compute every \(a^{(\ell)}, z^{(\ell)}\), and the loss \(E\).
← Backward pass
Propagate \(\partial E\) back through the layers, compute every \(\partial E / \partial W^{(\ell)}\).
Forward pass
Compute pre-activations \(a^{(\ell)}\) and activations \(z^{(\ell)} = g(a^{(\ell)})\) layer by layer.
\[ a^{(1)}_j = \sum_i x_i\, W^{(1)}_{i,j}, \qquad z^{(1)}_j = g(a^{(1)}_j) \]
\[ a^{(2)}_k = \sum_j z^{(1)}_j\, W^{(2)}_{j,k}, \qquad z^{(2)}_k = g(a^{(2)}_k) \]
\[ E = E\!\left(z^{(2)}, y\right) \]
Cache every \(a^{(\ell)}\) and \(z^{(\ell)}\) — we'll need them on the way back.
Backward pass
Walk back through the network, computing local errors \(\delta\) via the chain rule.
Layer 2 (output side):
\[ \dfrac{\partial E}{\partial W^{(2)}_{j,k}} = \delta^{(2)}_k\, z^{(1)}_j, \qquad \delta^{(2)}_k = \dfrac{\partial E}{\partial z^{(2)}_k}\, g'\!\bigl(a^{(2)}_k\bigr) \]
Layer 1 (propagate back):
\[ \dfrac{\partial E}{\partial W^{(1)}_{i,j}} = \delta^{(1)}_j\, x_i, \qquad \delta^{(1)}_j = \Bigl( \sum_k \delta^{(2)}_k\, W^{(2)}_{j,k} \Bigr)\, g'\!\bigl(a^{(1)}_j\bigr) \]
Pattern: gradient = local error \(\times\) input to that weight.
The recursive structure of \(\delta\)
For unit \(j\) in layer \(\ell\), define \(\delta^{(\ell)}_j = \dfrac{\partial E}{\partial a^{(\ell)}_j}\). Then:
\[
\delta^{(\ell)}_j =
\begin{cases}
\dfrac{\partial E}{\partial z^{(\ell)}_j}\, g'(a^{(\ell)}_j), & \text{output unit (base)} \\[0.8em]
\Bigg( \displaystyle \sum_k \delta^{(\ell+1)}_k\, W^{(\ell+1)}_{j,k} \Bigg)\, g'(a^{(\ell)}_j), & \text{hidden unit (recursive)}
\end{cases}
\]
\(\delta\) for a hidden unit = weighted sum of downstream \(\delta\)s, modulated by \(g'\).
Backprop in matrix form
Stack layer activations into vectors. Let \(\delta_\ell = \partial E / \partial z^{(\ell)}\) and \(W_\ell\) be the weight matrix.
Algorithm.
- Initialize weights \(W_\ell\) for every layer.
- Forward: push \(x\) through, cache \(z^{(1)}, z^{(2)}, \ldots\)
- Output \(\delta\): set \(\delta_n = \partial E / \partial z^{(n)}\).
- Backward sweep, for \(\ell = n, n\!-\!1, \ldots, 1\):
\(\delta_{\ell-1} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot W_\ell\)
\(\dfrac{\partial E}{\partial W_\ell} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot z^{(\ell-1)}\)
- Plug \(\partial E / \partial W_\ell\) into gradient descent.
The sigmoid derivative trick
For \(g(x) = \dfrac{1}{1 + e^{-x}}\):
\[ g'(x) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = g(x)\,\big(1 - g(x)\big) \]
Why it matters: during the forward pass we already compute \(g(a)\). Reuse it — no need to redo the exponential during the backward pass.
Similar shortcuts exist for tanh (\(1 - g^2\)) and ReLU (0 or 1) — ReLU's constant gradient is a big reason it beats sigmoid in deep nets.
Beyond plain SGD: momentum and Adam
Plain SGD is noisy and slow in ravines. Two ideas fix it:
Momentum
Accumulate a velocity (an EWMA of past gradients) so consistent directions build speed:
\(v \leftarrow \beta v - \eta\,\nabla_W L, \quad W \leftarrow W + v\)
Adam
Per-parameter step sizes: divide by a running estimate of each gradient's magnitude, plus momentum. Robust default:
\(W \leftarrow W - \eta\,\dfrac{\hat{m}}{\sqrt{\hat{v}}+\epsilon}\)
AdamW (Adam + decoupled weight decay) is the standard optimizer for transformers/LLMs today. [optional deep-dive deck]
Making training generalize & stay stable
Regularization (fight overfitting)
- Weight decay: add \(\lambda\lVert W\rVert^2\) to the loss (L16).
- Dropout: randomly zero some units each step so the net can't rely on any one.
- Early stopping: halt when validation loss turns up.
Normalization & schedules (stay stable)
- Batch / layer norm: re-center & scale activations so gradients stay well-behaved.
- LR warmup + cosine decay: ramp \(\eta\) up, then anneal it down.
- Gradient clipping: cap the gradient norm to avoid blow-ups.
These are the everyday knobs behind training any modern deep network.
When to use (or not use) a neural network
Reach for an NN when
- High-dimensional, real-valued, or noisy inputs.
- Target function form is unknown (no good hand-crafted model).
- Interpretability is not a priority.
- Plenty of training data is available.
Avoid an NN when
- Architecture is hard to choose (layers, units, activations).
- Weights need to be inspected and explained.
- Data is tabular and small — overfitting risk is high.
Neural network vs decision tree
|
Neural network |
Decision tree |
| Data type | Images, audio, text | Tabular data (gradient boosting usually wins here) |
| Data size | Needs a lot; easy to overfit small data | Works with very little data |
| Target function | Arbitrary functions | Nested if/else rules |
| Architecture | Layers, units, activations, init, \(\eta\) all critical | A few hyperparameters (depth, pruning) |
| Interpretability | Black box | Easy to explain to humans |
| Train / inference | Slow to train and run | Fast |
Revisiting learning goals
- Write the cross-entropy loss and the gradient-descent update.
- Describe back-propagation and compute a weight's gradient.
- Speed up training with momentum and Adam / AdamW.
- Fight overfitting with regularization, dropout, and normalization.
- Decide when to use a neural network vs a decision tree.
Next: Transformers & Attention
We can now build and train deep nets. But how do we handle sequences — language, code, audio — at scale?
- Why RNNs struggle with long sequences.
- Self-attention: the idea behind every modern LLM.
- The transformer block, end to end.
L21: Transformers & Attention.