Three identical networks train on the same dataset simultaneously — one with SGD, one with Adagrad, one with Adam. Watch how quickly each loss falls and how stable the curves are near convergence.
Lower is better. Each optimizer uses the same learning rate schedule and the same mini-batches.
Same data (dots), three different fitted curves after the same number of steps.