Lecture 19
RN 21.1 · PM 7.5
In vision, speech, translation, and beyond, the relationship between inputs and outputs is too complex to hand-program.
We need a model that can:
Loose inspiration: the human brain — many simple components, densely connected.
The math model abstracts this into weighted sum + non-linear activation.
\(in_j = \sum_{i = 0}^{n} w_{ij}\, a_i\)
\(a_j = g(in_j)\)
\(a_0 = 1\) is the bias; \(w_{0j}\) acts as a threshold. \(g\) is a non-linear activation function.
Complex relationships are non-linear; stacking linear layers stays linear.
Fires (\(\approx 1\)) when input is big; quiet (\(\approx 0\)) otherwise.
So we can train via gradient-based optimization (next lecture!).
\(g(x) = 1\) if \(x > 0\), else \(0\).
Simple but not differentiable; rarely used in practice.
\(g(x) = \dfrac{1}{1 + e^{-x}}\)
Smooth, but vanishing-gradient in saturation regions.
\(g(x) = \max(0, x)\)
Fast, sparse; dying-ReLU when input stays negative.
\(g(x) = \max(0, x) + k\cdot\min(0, x)\)
Small negative slope keeps gradient alive on the left.
Connections form a DAG (no loops). The network is just a function of its inputs.
Outputs feed back as inputs. Supports short-term memory — behavior depends on history.
A single-layer feedforward network: inputs connect directly to outputs. Can represent AND, OR, NOT and many other logical functions.
Q1. The perceptron below uses step activation with weights \(w_0 = 0.5\), \(w_1 = -1\), \(w_2 = -1\). What logical function does it compute?
Q2. Find weights \((w_0, w_1, w_2)\) so the perceptron \(o = g(w_0 + w_1 x_1 + w_2 x_2)\) computes \(x_1 \wedge x_2\). The AND truth table outputs 1 only on \((1, 1)\).
Step activation: \(g(x) = 1\) if \(x > 0\), else \(0\).
Q3. What does \(h_1 = g(x_1 + x_2 - 0.5)\) compute?
Q4. What does \(h_2 = g(-x_1 - x_2 + 1.5)\) compute?
Remember \(h_1\) and \(h_2\) — we'll combine them to build XOR.
A perceptron is a linear classifier. Its decision boundary is a hyperplane in the input space.
Minsky & Papert (1969) showed this — and it triggered the first AI winter.
Rewrite XOR using gates a perceptron can handle: \(x_1 \oplus x_2 = (x_1 \vee x_2) \wedge \neg(x_1 \wedge x_2) = h_1 \wedge h_2\).
Truth-table check:
| \(x_1\) | \(x_2\) | \(h_1\) (OR) | \(h_2\) (NAND) | \(h_1 \wedge h_2\) |
|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 |
| 0 | 1 | 1 | 1 | 1 |
| 1 | 0 | 1 | 1 | 1 |
| 1 | 1 | 1 | 0 | 0 |
Matches XOR exactly. One hidden layer is enough.
When inputs come as a sequence (text, audio, sensor data), the same network reads one element at a time and carries a state across steps.
\(h_t = f_W(h_{t-1},\, x_t), \quad y_t = f_Y(h_t)\)
The state \(h_t\) summarizes everything seen so far.
Same weights at every step — the model can handle any-length sequence with a fixed parameter count.
\(h_t = \tanh\!\bigl(W_{hh}\, h_{t-1} + W_{xh}\, x_t\bigr), \quad y_t = W_{hy}\, h_t\)
The same \(W_{hh}, W_{xh}, W_{hy}\) are reused at every time step.
Feed characters one at a time; predict the next character at each step.
Trained to predict the next character; sample autoregressively to generate text.
L20: how do we actually train these networks? — gradient descent + backpropagation.