RL 101 - Lesson 5 - Deeper Convolutions
29 Oct 2025
Going deeper means stacking convolutional blocks so earlier layers detect edges and textures while later layers assemble those features into object parts. This lesson builds a two-block CNN for colour images.
Why go deeper?
A single conv layer can detect oriented edges and color blobs. Stacking two conv layers lets the second layer combine edge detectors into corners and curves — a hierarchical feature pyramid that corresponds roughly to how mammalian visual cortex is organized.
Architecture
Conv(3×3, 8 filters) → ReLU
→ Conv(3×3, 8 filters) → ReLU → MaxPool(2×2)
→ Dense(32) → ReLU → Dense(6 classes)
We work on synthetic 12×12 RGB patches (6 shape categories) to keep compute browser-friendly. The key difference from Lesson 4: two successive conv layers before pooling, giving the network more representational power per receptive field.
Receptive field growth
After the first conv, each output pixel “sees” a 3×3 input region. After the second conv, each output pixel sees a 5×5 input region. Max pooling then doubles the effective receptive field again. Deeper networks capture global structure through this pyramidal growth.
Practical considerations
- Batch normalization (not used here for simplicity) normally sits between conv and ReLU; it accelerates training dramatically in practice.
- Data augmentation (random flips, crops) multiplies effective dataset size and is critical for real CIFAR-10 training.
- Stacking small 3×3 filters is preferred over a single large filter: two 3×3 convolutions cover the same 5×5 receptive field with fewer parameters and an extra non-linearity.
Live demo
The left panel visualizes the 8 learned filters after training — you should see edge detectors emerge. The right panel plots per-class accuracy to diagnose which shapes are hardest.
Key takeaways
- Deeper networks capture higher-level features through hierarchical composition.
- Two 3×3 filters are more parameter-efficient than one 5×5 filter while providing an extra ReLU non-linearity.
- Real CIFAR-10 accuracy benefits greatly from deeper architectures (ResNet-style) and regularization techniques not shown here.
Next up — Lesson 6: we flip supervision on its head with autoencoders — learning a compressed latent representation with no labels.