CS 486/686
Neural Networks II
Lecture 20
RN 19.6.2, 21.1, 21.2 · PM 7.5 · GBC 4.3, 6.5
Search
›
Uncertainty
›
Decisions
›
Learning
Learning goals
- Explain the steps of gradient descent.
- Modify GD to speed up learning and ensure convergence.
- Describe the back-propagation algorithm (forward + backward passes).
- Compute the gradient for a weight in a multi-layer NN.
- Decide when to use a neural network vs a decision tree.
The training problem
A 2-layer feedforward network for spam classification — predict \([a^{(2)}_1, a^{(2)}_2]\) from two features (email length, sender trust).
Two outputs encode "spam" vs "ham":
- Spam → target \([1, 0]\)
- Ham → target \([0, 1]\)
- Ambiguous → \([0.5, 0.5]\)
We have paired data \((x_1, x_2, y)\). How do we set the weights \(W^{(1)}, W^{(2)}\)?
Loss function and gradient descent
Squared loss on a dataset \(D = \{(x^{(i)}, y^{(i)})\}_{i=1}^N\) with network output \(\hat{y}^{(i)} = f(x^{(i)}; W)\):
\[ L(W) = \tfrac{1}{2} \sum_{i=1}^{N} \big\| \hat{y}^{(i)} - y^{(i)} \big\|^2 \]
Gradient descent: repeat for many iterations:
- Compute the gradient \(\nabla_W L(W) = \big[ \partial L / \partial w \big]_{w \in W}\).
- Update every weight: \( w \leftarrow w - \eta\, \dfrac{\partial L}{\partial w} \).
\(\eta > 0\) is the learning rate (step size).
Why this works
Direction: steepest descent
\(\nabla_W L\) points in the direction of fastest increase of \(L\).
Moving in \(-\nabla_W L\) decreases \(L\) the fastest.
Step size: \(\eta\) trade-off
Too small \(\Rightarrow\) very slow learning.
Too large \(\Rightarrow\) overshoots the minimum, may diverge.
Three flavors of gradient descent
Batch GD
Gradient uses all \(N\) examples per step.
Stable, accurate gradient.
Slow for large \(N\).
Stochastic GD
One example per step.
Cheap; noise helps escape local minima.
Noisy updates.
Mini-batch GD
A small batch (e.g. 32, 64) per step.
Best of both worlds; GPU-friendly.
Standard choice today.
Plus tricks like momentum, Adam, learning-rate schedules — covered in CS480/680.
Back-propagation: efficient gradients
Two passes per training example \((x, y)\):
→ Forward pass
Push inputs through layers, compute every \(a^{(\ell)}, z^{(\ell)}\), and the loss \(E\).
← Backward pass
Propagate \(\partial E\) back through the layers, compute every \(\partial E / \partial W^{(\ell)}\).
Forward pass
Compute pre-activations \(a^{(\ell)}\) and activations \(z^{(\ell)} = g(a^{(\ell)})\) layer by layer.
\[ a^{(1)}_j = \sum_i x_i\, W^{(1)}_{i,j}, \qquad z^{(1)}_j = g(a^{(1)}_j) \]
\[ a^{(2)}_k = \sum_j z^{(1)}_j\, W^{(2)}_{j,k}, \qquad z^{(2)}_k = g(a^{(2)}_k) \]
\[ E = E\!\left(z^{(2)}, y\right) \]
Cache every \(a^{(\ell)}\) and \(z^{(\ell)}\) — we'll need them on the way back.
Backward pass
Walk back through the network, computing local errors \(\delta\) via the chain rule.
Layer 2 (output side):
\[ \dfrac{\partial E}{\partial W^{(2)}_{j,k}} = \delta^{(2)}_k\, z^{(1)}_j, \qquad \delta^{(2)}_k = \dfrac{\partial E}{\partial z^{(2)}_k}\, g'\!\bigl(a^{(2)}_k\bigr) \]
Layer 1 (propagate back):
\[ \dfrac{\partial E}{\partial W^{(1)}_{i,j}} = \delta^{(1)}_j\, x_i, \qquad \delta^{(1)}_j = \Bigl( \sum_k \delta^{(2)}_k\, W^{(2)}_{j,k} \Bigr)\, g'\!\bigl(a^{(1)}_j\bigr) \]
Pattern: gradient = local error \(\times\) input to that weight.
The recursive structure of \(\delta\)
For unit \(j\) in layer \(\ell\), define \(\delta^{(\ell)}_j = \dfrac{\partial E}{\partial a^{(\ell)}_j}\). Then:
\[
\delta^{(\ell)}_j =
\begin{cases}
\dfrac{\partial E}{\partial z^{(\ell)}_j}\, g'(a^{(\ell)}_j), & \text{output unit (base)} \\[0.8em]
\Bigg( \displaystyle \sum_k \delta^{(\ell+1)}_k\, W^{(\ell+1)}_{j,k} \Bigg)\, g'(a^{(\ell)}_j), & \text{hidden unit (recursive)}
\end{cases}
\]
\(\delta\) for a hidden unit = weighted sum of downstream \(\delta\)s, modulated by \(g'\).
Backprop in matrix form
Stack layer activations into vectors. Let \(\delta_\ell = \partial E / \partial z^{(\ell)}\) and \(W_\ell\) be the weight matrix.
Algorithm.
- Initialize weights \(W_\ell\) for every layer.
- Forward: push \(x\) through, cache \(z^{(1)}, z^{(2)}, \ldots\)
- Output \(\delta\): set \(\delta_n = \partial E / \partial z^{(n)}\).
- Backward sweep, for \(\ell = n, n\!-\!1, \ldots, 1\):
\(\delta_{\ell-1} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot W_\ell\)
\(\dfrac{\partial E}{\partial W_\ell} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot z^{(\ell-1)}\)
- Plug \(\partial E / \partial W_\ell\) into gradient descent.
The sigmoid derivative trick
For \(g(x) = \dfrac{1}{1 + e^{-x}}\):
\[ g'(x) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = g(x)\,\big(1 - g(x)\big) \]
Why it matters: during the forward pass we already compute \(g(a)\). Reuse it — no need to redo the exponential during the backward pass.
Similar shortcuts exist for tanh (\(1 - g^2\)) and ReLU (0 or 1).
When to use (or not use) a neural network
Reach for an NN when
- High-dimensional, real-valued, or noisy inputs.
- Target function form is unknown (no good hand-crafted model).
- Interpretability is not a priority.
- Plenty of training data is available.
Avoid an NN when
- Architecture is hard to choose (layers, units, activations).
- Weights need to be inspected and explained.
- Data is tabular and small — overfitting risk is high.
Neural network vs decision tree
|
Neural network |
Decision tree |
| Data type | Images, audio, text | Tabular data |
| Data size | Needs a lot; easy to overfit small data | Works with very little data |
| Target function | Arbitrary functions | Nested if/else rules |
| Architecture | Layers, units, activations, init, \(\eta\) all critical | A few hyperparameters (depth, pruning) |
| Interpretability | Black box | Easy to explain to humans |
| Train / inference | Slow to train and run | Fast |
Revisiting learning goals
- Explain the steps of gradient descent.
- Modify GD to speed up learning and ensure convergence.
- Describe the back-propagation algorithm (forward + backward passes).
- Compute the gradient for a weight in a multi-layer NN.
- Decide when to use a neural network vs a decision tree.
Next: Neural Networks III
- Convolutional networks — weight sharing for images.
- Word embeddings and sequence models.
- A peek at transformers and large language models.
See you on Thursday!