RL 101 - Lesson 5 - Deeper Convolutions

Going deeper means stacking convolutional blocks so earlier layers detect edges and textures while later layers assemble those features into object parts. This lesson builds a two-block CNN for colour images.

Why go deeper?

A single conv layer can detect oriented edges and color blobs. Stacking two conv layers lets the second layer combine edge detectors into corners and curves — a hierarchical feature pyramid that corresponds roughly to how mammalian visual cortex is organized.

Architecture

Conv(3×3, 8 filters) → ReLU → Conv(3×3, 8 filters) → ReLU → MaxPool(2×2) → Dense(32) → ReLU → Dense(6 classes)

We work on synthetic 12×12 RGB patches (6 shape categories) to keep compute browser-friendly. The key difference from Lesson 4: two successive conv layers before pooling, giving the network more representational power per receptive field.

Receptive field growth

After the first conv, each output pixel “sees” a 3×3 input region. After the second conv, each output pixel sees a 5×5 input region. Max pooling then doubles the effective receptive field again. Deeper networks capture global structure through this pyramidal growth.

Practical considerations

  • Batch normalization (not used here for simplicity) normally sits between conv and ReLU; it accelerates training dramatically in practice.
  • Data augmentation (random flips, crops) multiplies effective dataset size and is critical for real CIFAR-10 training.
  • Stacking small 3×3 filters is preferred over a single large filter: two 3×3 convolutions cover the same 5×5 receptive field with fewer parameters and an extra non-linearity.

Live demo

The left panel visualizes the 8 learned filters after training — you should see edge detectors emerge. The right panel plots per-class accuracy to diagnose which shapes are hardest.

Key takeaways

  • Deeper networks capture higher-level features through hierarchical composition.
  • Two 3×3 filters are more parameter-efficient than one 5×5 filter while providing an extra ReLU non-linearity.
  • Real CIFAR-10 accuracy benefits greatly from deeper architectures (ResNet-style) and regularization techniques not shown here.

Next up — Lesson 6: we flip supervision on its head with autoencoders — learning a compressed latent representation with no labels.