RL 101 - Lesson 4 - Convolutional Networks

Images are grids of pixels, not flat vectors. Convolutional layers exploit spatial structure by sharing weights across positions — a huge win in parameter efficiency and translation invariance.

Why convolutions?

A fully connected layer on a 28×28 image needs 784 weights per neuron in the first layer. A 3×3 convolutional filter uses only 9 weights — the same filter slides over every location in the image. This weight sharing means:

  1. Fewer parameters → less overfitting on small datasets.
  2. Translation equivariance: a “7” detected at the top-left activates the same filter at the bottom-right.

Architecture

Our MNIST network is intentionally small to run in-browser:

Conv(3×3, 4 filters) → ReLU → MaxPool(2×2) → Dense(16) → ReLU → Dense(4 classes)

After max-pooling, the spatial resolution halves and we flatten to a dense classifier. We train on synthetic 8×8 digit sketches (4 digit classes) drawn with simple stroke rules — not the full MNIST dataset, but enough to show the CNN concept clearly.

Backprop through a conv layer

For a filter $W$ of shape $K\times K\times C_{in}$ and output $Y$:

\[\frac{\partial \mathcal{L}}{\partial W} = \sum_{\text{positions}} X_{\text{patch}} \cdot \frac{\partial \mathcal{L}}{\partial Y},\]

and the gradient flowing back to the input is a full convolution (transposed) of the error signal with the filter. Max-pooling back-propagates the gradient only through the index that achieved the maximum during the forward pass.

Live demo

The left panel shows a live sample digit; the right panel tracks training accuracy. After a few hundred steps you should see >80 % accuracy on the 4-class digit problem.

Key takeaways

  • Convolutional layers are the building block of all modern computer vision networks.
  • Max pooling reduces spatial dimensions and introduces a form of local translation invariance.
  • Even a tiny CNN (4 filters, 1 pooling layer) decisively outperforms a same-size MLP on image data because it exploits spatial locality.

Next up — Lesson 5: we scale up to CIFAR-10-style colour image classification with stacked convolutions.