Optimization Lab 5 - Learning Rate Schedules

05 May 2026

Choosing an optimizer is only half the story. The other half is deciding how its learning rate should evolve during training. A good schedule is often the difference between fast progress, late stability, and complete failure.

Why a fixed learning rate is rarely ideal

Early in training, gradients are large and parameters are far from a good region. Large steps are useful because we want to move quickly.

Late in training, we are usually trying to refine a nearly good solution. Smaller steps are better because:

the local curvature may be sharper
mini-batch noise dominates
we want the optimizer to settle instead of jitter

That already suggests the key principle:

Start relatively large, end relatively small.

Common schedules

Constant

\[\eta_t = \eta_0\]

The baseline. Simple, but it keeps the same aggressiveness forever.

Step decay

\[\eta_t = \eta_0 \gamma^{\lfloor t / T \rfloor}\]

Every $T$ steps or epochs, multiply by a factor $\gamma < 1$. This is one of the oldest and still one of the most common schedules.

Cosine decay

\[\eta_t = \eta_0 \cdot \frac{1}{2}\left(1 + \cos\frac{\pi t}{T_{\max}}\right)\]

The rate decreases smoothly instead of dropping abruptly. This tends to make the late stage of training more stable.

Warmup + cosine

Warmup starts from a small learning rate and increases linearly for the first few steps, then switches to cosine decay. This is now standard in large-batch training and transformers because the earliest gradients can be untrustworthy or badly scaled.

Schedules vs optimizers

It is tempting to think Adam or SGD determines everything. In practice:

Adam with a bad schedule can still plateau or diverge
SGD with momentum and a good schedule can outperform Adam on many vision tasks
large models often rely on warmup even when the optimizer itself is unchanged

The schedule controls when the optimizer explores and when it settles.

Live demo

All four runs use the same noisy quadratic objective and the same gradient noise sequence. The only difference is the learning-rate schedule:

constant
step decay
cosine
warmup + cosine

Things to observe

Constant learning rate falls fast at first but often keeps bouncing near the minimum.
Step decay introduces visible changes in behaviour exactly when the drops happen.
Cosine decay tends to be smoother, especially late in training.
Warmup is slowest for the first few steps on purpose, but it avoids unstable early jumps.

Key takeaways

A schedule controls the exploration-to-refinement trade-off over time.
Constant learning rates are often too rigid for long deep-learning runs.
Step and cosine decay reduce late-stage noise without sacrificing early speed.
Warmup is especially useful when the very first gradients are large, noisy, or poorly calibrated.

Numeric Jungle Navigating the Road to Reality.