Optimization Lab 4 - Adagrad, RMSProp, and Adam
04 May 2026
Plain SGD uses one global learning rate for every parameter. Deep networks rarely behave that uniformly. Some weights receive huge gradients every step; others activate only occasionally. Adaptive optimizers try to compensate by scaling each coordinate separately.
The core idea: per-parameter step sizes
Instead of updating with
\[\theta_{t+1} = \theta_t - \eta g_t,\]we maintain a diagonal preconditioner and use
\[\theta_{t+1} = \theta_t - D_t^{-1/2} g_t\]for some coordinate-wise statistic $D_t$ built from past gradients.
If a coordinate has seen consistently large gradients, we shrink its step. If a coordinate is rarely active, we allow larger effective updates when it finally appears.
Adagrad
Adagrad accumulates squared gradients forever:
\(G_t = \sum_{\tau=1}^t g_\tau \odot g_\tau\) \(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{G_t} + \epsilon} \odot g_t\)
This is excellent for sparse features because rarely updated coordinates keep relatively large step sizes. The downside is that $G_t$ only grows, so the effective learning rate keeps shrinking and can become too small late in training.
RMSProp
RMSProp replaces the cumulative sum with an exponential moving average:
\(S_t = \rho S_{t-1} + (1-\rho) g_t \odot g_t\) \(\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{S_t} + \epsilon} \odot g_t\)
Now the optimizer has a finite memory. Large gradients from the distant past do not suppress learning forever.
Adam
Adam combines RMS-style scaling with momentum:
\(m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t\) \(v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t \odot g_t\)
Because both moving averages start at zero, Adam uses bias correction:
\[\hat m_t = \frac{m_t}{1-\beta_1^t}, \qquad \hat v_t = \frac{v_t}{1-\beta_2^t}\]and updates with
\[\theta_{t+1} = \theta_t - \eta \frac{\hat m_t}{\sqrt{\hat v_t} + \epsilon}\]This is why Adam often works well out of the box: it has both direction smoothing and coordinate-wise scaling.
Sparse gradients and embeddings
A classic use case is language or recommendation models with embedding tables. For any one example, only a tiny subset of rows may be active, so many parameters receive zero gradient most of the time. Adaptive methods can make those rare coordinates learn much faster than a single fixed global learning rate would allow.
The demo below recreates that situation with a two-parameter regression problem where one feature is intentionally sparse.
Live demo
The parameter-space plot shows the full loss, while the right-hand bars show the current effective learning rate each optimizer is applying to each coordinate.
Things to observe
- Lower the sparse-feature activity. Adagrad increasingly emphasizes the rare coordinate.
- Watch the bar chart rather than just the loss curve. That is where the preconditioning behaviour becomes obvious.
- RMSProp forgets old gradient magnitudes, so its effective rates stay more mobile than Adagrad’s.
- Adam usually reaches the optimum fastest because it combines adaptive scaling with momentum.
Key takeaways
- Adaptive optimizers rescale each coordinate separately instead of sharing one learning rate.
- Adagrad is strong for sparse problems but decays forever.
- RMSProp adds a finite memory through an exponential moving average.
- Adam layers momentum on top of RMS-style adaptation, which is why it is such a common deep-learning default.