Three identical networks train on the same dataset simultaneously — one with SGD, one with Adagrad, one with Adam. Watch how quickly each loss falls and how stable the curves are near convergence.

Loss curves

Lower is better. Each optimizer uses the same learning rate schedule and the same mini-batches.

SGD (lr 0.05)   Adagrad (lr 0.1)   Adam (lr 0.005)

Current curve fits

Same data (dots), three different fitted curves after the same number of steps.