RL 101 - Lesson 2 - Decision Boundaries
08 Oct 2025
Regression predicts a number; classification predicts a category. Visualizing how a network carves up 2-D space into regions is one of the most intuitive ways to grasp what “learning a representation” means.
The moon dataset
We generate two interleaved half-moon clouds of 200 points each. No linear boundary can separate them — a straight line will always misclassify a large chunk. A two-hidden-layer network, however, can learn a curved decision boundary that wraps around both moons.
Softmax + cross-entropy
For binary classification we use a [2 → 20 → 20 → 2] network and output class probabilities via softmax:
We minimize cross-entropy loss:
\[\mathcal{L} = -\frac{1}{N}\sum_{i=1}^N \log p_{y_i}(\mathbf{x}_i).\]The combined softmax + cross-entropy gradient is elegantly simple:
\[\frac{\partial \mathcal{L}}{\partial z_k} = p_k - \mathbf{1}[k = y].\]Live demo
The heatmap updates every 20 steps — blue for class 0, orange for class 1. Watch the boundary curve and sharpen as training progresses.
Key takeaways
- Non-linear activation functions are what allow the network to produce curved decision boundaries.
- Wider or deeper networks can fit more complex boundaries but may overfit on small datasets.
- Visualizing the decision boundary is invaluable for debugging: if the boundary is jagged or wraps too tightly around individual points, reduce model capacity or add regularization.
Next up — Lesson 3: we compare SGD, Adagrad, and Adam side-by-side on the same problem.