RL 101 - Lesson 4 - Convolutional Networks
22 Oct 2025
Images are grids of pixels, not flat vectors. Convolutional layers exploit spatial structure by sharing weights across positions — a huge win in parameter efficiency and translation invariance.
Why convolutions?
A fully connected layer on a 28×28 image needs 784 weights per neuron in the first layer. A 3×3 convolutional filter uses only 9 weights — the same filter slides over every location in the image. This weight sharing means:
- Fewer parameters → less overfitting on small datasets.
- Translation equivariance: a “7” detected at the top-left activates the same filter at the bottom-right.
Architecture
Our MNIST network is intentionally small to run in-browser:
Conv(3×3, 4 filters) → ReLU → MaxPool(2×2)
→ Dense(16) → ReLU → Dense(4 classes)
After max-pooling, the spatial resolution halves and we flatten to a dense classifier. We train on synthetic 8×8 digit sketches (4 digit classes) drawn with simple stroke rules — not the full MNIST dataset, but enough to show the CNN concept clearly.
Backprop through a conv layer
For a filter $W$ of shape $K\times K\times C_{in}$ and output $Y$:
\[\frac{\partial \mathcal{L}}{\partial W} = \sum_{\text{positions}} X_{\text{patch}} \cdot \frac{\partial \mathcal{L}}{\partial Y},\]and the gradient flowing back to the input is a full convolution (transposed) of the error signal with the filter. Max-pooling back-propagates the gradient only through the index that achieved the maximum during the forward pass.
Live demo
The left panel shows a live sample digit; the right panel tracks training accuracy. After a few hundred steps you should see >80 % accuracy on the 4-class digit problem.
Key takeaways
- Convolutional layers are the building block of all modern computer vision networks.
- Max pooling reduces spatial dimensions and introduces a form of local translation invariance.
- Even a tiny CNN (4 filters, 1 pooling layer) decisively outperforms a same-size MLP on image data because it exploits spatial locality.
Next up — Lesson 5: we scale up to CIFAR-10-style colour image classification with stacked convolutions.