step 0
The learning rate is too high for this ravine. Plain gradient descent becomes unstable before momentum has a chance to help.
Gradient descent
Momentum
Nesterov