Optimization Lab 1 - Gradient Descent and the Geometry of Learning

01 May 2026

Training a neural network is, at bottom, a problem in numerical optimization. We choose parameters $\theta$, define a loss $L(\theta)$, and then repeatedly move $\theta$ in whatever direction makes the loss go down fastest. The simplest version of that idea is gradient descent.

The optimization problem behind deep learning

Given a dataset ${(x_i, y_i)}_{i=1}^N$ and model $f(x;\theta)$, training solves

\[\min_\theta L(\theta), \qquad L(\theta) = \frac{1}{N}\sum_{i=1}^N \ell\!\bigl(f(x_i;\theta), y_i\bigr)\]

The loss may be mean-squared error, cross-entropy, contrastive loss, or something more exotic, but the optimization loop is always the same:

Compute the gradient $\nabla L(\theta_t)$
Take a step downhill
Repeat

The gradient points in the direction of steepest local increase, so the negative gradient is the direction of steepest local decrease.

Deriving the update rule

Take a first-order Taylor expansion around the current point $\theta_t$:

\[L(\theta_t + \Delta) \approx L(\theta_t) + \nabla L(\theta_t)^\top \Delta\]

If we constrain the step size to $|\Delta| = \eta$, the linear term is minimized by choosing $\Delta$ opposite the gradient:

\[\Delta = -\eta \frac{\nabla L(\theta_t)}{\|\nabla L(\theta_t)\|}\]

Absorbing the normalization into the learning rate gives the standard update

\[\theta_{t+1} = \theta_t - \eta \nabla L(\theta_t)\]

where $\eta > 0$ is the learning rate.

Why curvature matters

Near a local minimum $\theta^*$, the loss is well approximated by a quadratic:

\[L(\theta^\* + \delta) \approx L(\theta^\*) + \frac{1}{2}\delta^\top H \delta\]

where $H = \nabla^2 L(\theta^*)$ is the Hessian. Its eigenvalues measure curvature in different directions:

Large eigenvalue: steep direction
Small eigenvalue: flat direction

For a quadratic bowl, gradient descent is stable only if

\[0 < \eta < \frac{2}{\lambda_{\max}(H)}\]

If $\eta$ is too small, training crawls. If it is too large, the update overshoots and the loss oscillates or explodes.

The ratio

\[\kappa = \frac{\lambda_{\max}}{\lambda_{\min}}\]

is the condition number. When $\kappa$ is large, the bowl becomes a narrow ravine: one direction wants tiny steps, the other wants much larger ones. This is the basic geometric reason optimization in deep learning is difficult.

Why this matters for neural networks

Deep-network losses are not convex quadratics, but locally they still behave like curved surfaces. At every training step the optimizer is effectively navigating a changing, noisy quadratic approximation. The same issues already appear:

badly scaled directions
sharp vs flat curvature
exploding updates from overly large learning rates
slow progress along shallow directions

The toy 2D bowl in this lab is not a neural network, but it isolates the exact mathematical phenomenon that shows up in large models.

Live demo

The canvas shows the loss

\[f(x,y) = \frac{1}{2}(x^2 + \kappa y^2)\]

with adjustable learning rate $\eta$ and condition number $\kappa$. The gold path is the parameter vector moving under gradient descent. The right panel plots the loss on a log scale.

Things to observe

Increase $\kappa$ while keeping $\eta$ fixed. The path begins to zig-zag because the steep direction forces repeated overshoot corrections.
Set $\eta$ near $2/\kappa$. The trajectory becomes marginally stable and starts to bounce.
Reduce $\eta$ drastically. The path becomes safe but painfully slow.
Compare the path geometry against the loss plot: visible zig-zag on the left appears as slower decay on the right.

Key takeaways

Gradient descent is just repeated application of $\theta \leftarrow \theta - \eta \nabla L(\theta)$.
The Hessian controls which learning rates are stable.
Ill-conditioning, not just non-convexity, is a major source of optimization difficulty.
Most of the optimizers used in deep learning are attempts to fix plain gradient descent’s sensitivity to curvature.

Numeric Jungle Navigating the Road to Reality.