CS 486/686 Lecture 20 — Neural Networks II

CS 486/686
Neural Networks II

Lecture 20

RN 19.6.2, 21.1, 21.2 · PM 7.5 · GBC 4.3, 6.5

Learning goals

Explain the steps of gradient descent.
Modify GD to speed up learning and ensure convergence.
Describe the back-propagation algorithm (forward + backward passes).
Compute the gradient for a weight in a multi-layer NN.
Decide when to use a neural network vs a decision tree.

The training problem

A 2-layer feedforward network for spam classification — predict \([a^{(2)}_1, a^{(2)}_2]\) from two features (email length, sender trust).

Two outputs encode "spam" vs "ham":

Spam → target \([1, 0]\)
Ham → target \([0, 1]\)
Ambiguous → \([0.5, 0.5]\)

We have paired data \((x_1, x_2, y)\). How do we set the weights \(W^{(1)}, W^{(2)}\)?

Loss function and gradient descent

Squared loss on a dataset \(D = \{(x^{(i)}, y^{(i)})\}_{i=1}^N\) with network output \(\hat{y}^{(i)} = f(x^{(i)}; W)\):

\[ L(W) = \tfrac{1}{2} \sum_{i=1}^{N} \big\| \hat{y}^{(i)} - y^{(i)} \big\|^2 \]

Gradient descent: repeat for many iterations:

Compute the gradient \(\nabla_W L(W) = \big[ \partial L / \partial w \big]_{w \in W}\).
Update every weight: \( w \leftarrow w - \eta\, \dfrac{\partial L}{\partial w} \).

\(\eta > 0\) is the learning rate (step size).

Why this works

Direction: steepest descent

\(\nabla_W L\) points in the direction of fastest increase of \(L\).

Moving in \(-\nabla_W L\) decreases \(L\) the fastest.

Step size: \(\eta\) trade-off

Too small \(\Rightarrow\) very slow learning.

Too large \(\Rightarrow\) overshoots the minimum, may diverge.

Three flavors of gradient descent

Batch GD

Gradient uses all \(N\) examples per step.

Stable, accurate gradient.

Slow for large \(N\).

Stochastic GD

One example per step.

Cheap; noise helps escape local minima.

Noisy updates.

Mini-batch GD

A small batch (e.g. 32, 64) per step.

Best of both worlds; GPU-friendly.

Standard choice today.

Plus tricks like momentum, Adam, learning-rate schedules — covered in CS480/680.

Back-propagation: efficient gradients

Two passes per training example \((x, y)\):

→ Forward pass

Push inputs through layers, compute every \(a^{(\ell)}, z^{(\ell)}\), and the loss \(E\).

← Backward pass

Propagate \(\partial E\) back through the layers, compute every \(\partial E / \partial W^{(\ell)}\).

Forward pass

Compute pre-activations \(a^{(\ell)}\) and activations \(z^{(\ell)} = g(a^{(\ell)})\) layer by layer.

\[ a^{(1)}_j = \sum_i x_i\, W^{(1)}_{i,j}, \qquad z^{(1)}_j = g(a^{(1)}_j) \]

\[ a^{(2)}_k = \sum_j z^{(1)}_j\, W^{(2)}_{j,k}, \qquad z^{(2)}_k = g(a^{(2)}_k) \]

\[ E = E\!\left(z^{(2)}, y\right) \]

Cache every \(a^{(\ell)}\) and \(z^{(\ell)}\) — we'll need them on the way back.

Backward pass

Walk back through the network, computing local errors \(\delta\) via the chain rule.

Layer 2 (output side):

\[ \dfrac{\partial E}{\partial W^{(2)}_{j,k}} = \delta^{(2)}_k\, z^{(1)}_j, \qquad \delta^{(2)}_k = \dfrac{\partial E}{\partial z^{(2)}_k}\, g'\!\bigl(a^{(2)}_k\bigr) \]

Layer 1 (propagate back):

\[ \dfrac{\partial E}{\partial W^{(1)}_{i,j}} = \delta^{(1)}_j\, x_i, \qquad \delta^{(1)}_j = \Bigl( \sum_k \delta^{(2)}_k\, W^{(2)}_{j,k} \Bigr)\, g'\!\bigl(a^{(1)}_j\bigr) \]

Pattern: gradient = local error \(\times\) input to that weight.

The recursive structure of \(\delta\)

For unit \(j\) in layer \(\ell\), define \(\delta^{(\ell)}_j = \dfrac{\partial E}{\partial a^{(\ell)}_j}\). Then:

\[ \delta^{(\ell)}_j = \begin{cases} \dfrac{\partial E}{\partial z^{(\ell)}_j}\, g'(a^{(\ell)}_j), & \text{output unit (base)} \\[0.8em] \Bigg( \displaystyle \sum_k \delta^{(\ell+1)}_k\, W^{(\ell+1)}_{j,k} \Bigg)\, g'(a^{(\ell)}_j), & \text{hidden unit (recursive)} \end{cases} \]

\(\delta\) for a hidden unit = weighted sum of downstream \(\delta\)s, modulated by \(g'\).

Backprop in matrix form

Stack layer activations into vectors. Let \(\delta_\ell = \partial E / \partial z^{(\ell)}\) and \(W_\ell\) be the weight matrix.

Algorithm.

Initialize weights \(W_\ell\) for every layer.
Forward: push \(x\) through, cache \(z^{(1)}, z^{(2)}, \ldots\)
Output \(\delta\): set \(\delta_n = \partial E / \partial z^{(n)}\).
Backward sweep, for \(\ell = n, n\!-\!1, \ldots, 1\):
\(\delta_{\ell-1} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot W_\ell\)
\(\dfrac{\partial E}{\partial W_\ell} = \delta_\ell \cdot \dfrac{\partial g(x^{(\ell)})}{\partial x^{(\ell)}} \cdot z^{(\ell-1)}\)
Plug \(\partial E / \partial W_\ell\) into gradient descent.

The sigmoid derivative trick

For \(g(x) = \dfrac{1}{1 + e^{-x}}\):

\[ g'(x) = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = g(x)\,\big(1 - g(x)\big) \]

Why it matters: during the forward pass we already compute \(g(a)\). Reuse it — no need to redo the exponential during the backward pass.

Similar shortcuts exist for tanh (\(1 - g^2\)) and ReLU (0 or 1).

When to use (or not use) a neural network

Reach for an NN when

High-dimensional, real-valued, or noisy inputs.
Target function form is unknown (no good hand-crafted model).
Interpretability is not a priority.
Plenty of training data is available.

Avoid an NN when

Architecture is hard to choose (layers, units, activations).
Weights need to be inspected and explained.
Data is tabular and small — overfitting risk is high.

Neural network vs decision tree

	Neural network	Decision tree
Data type	Images, audio, text	Tabular data
Data size	Needs a lot; easy to overfit small data	Works with very little data
Target function	Arbitrary functions	Nested if/else rules
Architecture	Layers, units, activations, init, \(\eta\) all critical	A few hyperparameters (depth, pruning)
Interpretability	Black box	Easy to explain to humans
Train / inference	Slow to train and run	Fast

Revisiting learning goals

Explain the steps of gradient descent.
Modify GD to speed up learning and ensure convergence.
Describe the back-propagation algorithm (forward + backward passes).
Compute the gradient for a weight in a multi-layer NN.
Decide when to use a neural network vs a decision tree.

Next: Neural Networks III

Convolutional networks — weight sharing for images.
Word embeddings and sequence models.
A peek at transformers and large language models.

See you on Thursday!