Course page
CS 486/686
Neural Networks I

Lecture 19

RN 21.1 · PM 7.5

Search Uncertainty Decisions Learning

Learning goals

  • Describe the simple mathematical model of a neuron.
  • Describe desirable properties of an activation function.
  • Distinguish feedforward from recurrent networks.
  • Learn a perceptron for a simple logical function.
  • Determine the logical function represented by a perceptron.
  • Explain why a perceptron cannot represent XOR.
  • Understand recurrent neural networks.

Learning complex non-linear relationships

In vision, speech, translation, and beyond, the relationship between inputs and outputs is too complex to hand-program.

We need a model that can:

  • Learn complex relationships.
  • Be trained efficiently on lots of data.
  • Generalize without overfitting.

Loose inspiration: the human brain — many simple components, densely connected.

From brain neurons to math neurons

  • Dendrites receive input signals.
  • Soma processes the signals.
  • Axon sends an output signal.
  • Synapses connect neurons together.
  • Neurons fire (or don't) based on the inputs.
dendrite dendrite dendrite soma axon synapses

The math model abstracts this into weighted sum + non-linear activation.

The math model (McCulloch & Pitts, 1943)

1 a1 a2 an w0j w1j w2j wnj Σ inj g(·) aj

\(in_j = \sum_{i = 0}^{n} w_{ij}\, a_i\)

\(a_j = g(in_j)\)

\(a_0 = 1\) is the bias; \(w_{0j}\) acts as a threshold. \(g\) is a non-linear activation function.

Desirable properties of the activation function

Non-linear

Complex relationships are non-linear; stacking linear layers stays linear.

Mimics real neurons

Fires (\(\approx 1\)) when input is big; quiet (\(\approx 0\)) otherwise.

Differentiable

So we can train via gradient-based optimization (next lecture!).

Four common activation functions

Step

\(g(x) = 1\) if \(x > 0\), else \(0\).

Simple but not differentiable; rarely used in practice.

Sigmoid

\(g(x) = \dfrac{1}{1 + e^{-x}}\)

Smooth, but vanishing-gradient in saturation regions.

ReLU

\(g(x) = \max(0, x)\)

Fast, sparse; dying-ReLU when input stays negative.

Leaky ReLU

\(g(x) = \max(0, x) + k\cdot\min(0, x)\)

Small negative slope keeps gradient alive on the left.

Networks: feedforward vs recurrent

Feedforward

Connections form a DAG (no loops). The network is just a function of its inputs.

Recurrent

Outputs feed back as inputs. Supports short-term memory — behavior depends on history.

Perceptron

A single-layer feedforward network: inputs connect directly to outputs. Can represent AND, OR, NOT and many other logical functions.

Identify a perceptron

Q1. The perceptron below uses step activation with weights \(w_0 = 0.5\), \(w_1 = -1\), \(w_2 = -1\). What logical function does it compute?

0.5 −1 −1 1 x1 x2 o
  1. \(x_1 \wedge x_2\)
  2. \(\neg (x_1 \wedge x_2)\)
  3. \(x_1 \vee x_2\)
  4. \(\neg (x_1 \vee x_2)\)  (NOR)
D — NOR. The output is \(g(0.5 - x_1 - x_2)\), which is 1 only when \(0.5 - x_1 - x_2 > 0\), i.e. both \(x_1\) and \(x_2\) are 0.

Learn a perceptron: AND

Q2. Find weights \((w_0, w_1, w_2)\) so the perceptron \(o = g(w_0 + w_1 x_1 + w_2 x_2)\) computes \(x_1 \wedge x_2\). The AND truth table outputs 1 only on \((1, 1)\).

  1. \(w_0 = -1,\ \ w_1 = 0.5,\ \ w_2 = 0.5\)
  2. \(w_0 = 0.5,\ \ w_1 = -1,\ \ w_2 = 1\)
  3. \(w_0 = 1.5,\ \ w_1 = -1,\ \ w_2 = -1\)
  4. \(w_0 = -1.5,\ \ w_1 = 1,\ \ w_2 = 1\)
D. Check: \(-1.5 + x_1 + x_2 > 0\) iff \(x_1 + x_2 > 1.5\), i.e. both equal 1. The other options either misclassify \((1,1)\) or fire on \((0, 0)\).

More perceptron arithmetic

Step activation: \(g(x) = 1\) if \(x > 0\), else \(0\).

Q3. What does \(h_1 = g(x_1 + x_2 - 0.5)\) compute?

  1. \(x_1 \vee x_2\) (OR)
  2. \(x_1 \wedge x_2\)
  3. \(\neg(x_1 \vee x_2)\)
  4. \(\neg(x_1 \wedge x_2)\)
A — OR. Outputs 1 if at least one input is 1.

Q4. What does \(h_2 = g(-x_1 - x_2 + 1.5)\) compute?

  1. \(x_1 \vee x_2\)
  2. \(x_1 \wedge x_2\)
  3. \(\neg(x_1 \vee x_2)\)
  4. \(\neg(x_1 \wedge x_2)\) (NAND)
D — NAND. Outputs 0 only when both inputs are 1.

Remember \(h_1\) and \(h_2\) — we'll combine them to build XOR.

Why a perceptron can't represent XOR

A perceptron is a linear classifier. Its decision boundary is a hyperplane in the input space.

AND

0 1 0 1
Separable

OR

0 1 0 1
Separable

XOR

0 1 0 1
Not separable

Minsky & Papert (1969) showed this — and it triggered the first AI winter.

XOR as a 2-layer network

Rewrite XOR using gates a perceptron can handle:  \(x_1 \oplus x_2 = (x_1 \vee x_2) \wedge \neg(x_1 \wedge x_2) = h_1 \wedge h_2\).

x1 x2 h1 h2 OR NAND o AND

Truth-table check:

\(x_1\)\(x_2\)\(h_1\)
(OR)
\(h_2\)
(NAND)
\(h_1 \wedge h_2\)
00010
01111
10111
11100

Matches XOR exactly. One hidden layer is enough.

Recurrent networks: process a sequence

When inputs come as a sequence (text, audio, sensor data), the same network reads one element at a time and carries a state across steps.

Recurrence

\(h_t = f_W(h_{t-1},\, x_t), \quad y_t = f_Y(h_t)\)

The state \(h_t\) summarizes everything seen so far.

Same weights at every step — the model can handle any-length sequence with a fixed parameter count.

RNN xt yt ht-1 ht (carry to next step)

Vanilla RNN: equations + unrolled

\(h_t = \tanh\!\bigl(W_{hh}\, h_{t-1} + W_{xh}\, x_t\bigr), \quad y_t = W_{hy}\, h_t\)

h1 x1 y1 Whh h2 x2 y2 Whh h3 x3 y3 Whh h4 x4 y4

The same \(W_{hh}, W_{xh}, W_{hy}\) are reused at every time step.

Application: language modeling

Feed characters one at a time; predict the next character at each step.

predictions hidden state input chars h t=1 h1 "e" e t=2 h2 "l" l t=3 h3 "l" l t=4 h4 "o"

Trained to predict the next character; sample autoregressively to generate text.

Learning goals (recap) — Next: training

  • ✓  Describe the math neuron.
  • ✓  Describe desirable properties of an activation function.
  • ✓  Distinguish feedforward from recurrent networks.
  • ✓  Learn / identify a perceptron for a logical function.
  • ✓  Explain why a perceptron cannot represent XOR.
  • ✓  Understand RNNs and how they process sequences.

L20: how do we actually train these networks? — gradient descent + backpropagation.