RL 101 - Lesson 5 - Deeper Convolutions

deep learning • machine learning • computer vision

29 Oct 2025

Going deeper means stacking convolutional blocks so earlier layers detect edges and textures while later layers assemble those features into object parts. This lesson builds a two-block CNN for colour images.

Why go deeper?

A single conv layer can detect oriented edges and color blobs. Stacking two conv layers lets the second layer combine edge detectors into corners and curves — a hierarchical feature pyramid that corresponds roughly to how mammalian visual cortex is organized.

Architecture

Conv(3×3, 8 filters) → ReLU → Conv(3×3, 8 filters) → ReLU → MaxPool(2×2) → Dense(32) → ReLU → Dense(6 classes)

We work on synthetic 12×12 RGB patches (6 shape categories) to keep compute browser-friendly. The key difference from Lesson 4: two successive conv layers before pooling, giving the network more representational power per receptive field.

Receptive field growth

After the first conv, each output pixel “sees” a 3×3 input region. After the second conv, each output pixel sees a 5×5 input region. Max pooling then doubles the effective receptive field again. Deeper networks capture global structure through this pyramidal growth.

Practical considerations

Batch normalization (not used here for simplicity) normally sits between conv and ReLU; it accelerates training dramatically in practice.
Data augmentation (random flips, crops) multiplies effective dataset size and is critical for real CIFAR-10 training.
Stacking small 3×3 filters is preferred over a single large filter: two 3×3 convolutions cover the same 5×5 receptive field with fewer parameters and an extra non-linearity.

Live demo

The left panel visualizes the 8 learned filters after training — you should see edge detectors emerge. The right panel plots per-class accuracy to diagnose which shapes are hardest.

Key takeaways

Deeper networks capture higher-level features through hierarchical composition.
Two 3×3 filters are more parameter-efficient than one 5×5 filter while providing an extra ReLU non-linearity.
Real CIFAR-10 accuracy benefits greatly from deeper architectures (ResNet-style) and regularization techniques not shown here.

Next up — Lesson 6: we flip supervision on its head with autoencoders — learning a compressed latent representation with no labels.

Numeric Jungle Navigating the Road to Reality.