Optimization Lab 6 - Newton's Method and Curvature

07 May 2026

Gradient descent uses only first-order information: the slope. Newton’s method goes one level deeper and asks for the local curvature as well. In principle that is exactly what we want. In practice, deep learning makes the price of curvature brutally high.

From first-order to second-order optimization

Around the current iterate $\theta_t$, take a second-order Taylor expansion:

\[L(\theta_t + p) \approx L(\theta_t) + g_t^\top p + \frac{1}{2} p^\top H_t p\]

where

\[g_t = \nabla L(\theta_t), \qquad H_t = \nabla^2 L(\theta_t).\]

To minimize this local quadratic model, differentiate with respect to $p$:

\[g_t + H_t p = 0\]

so the Newton step is

\[p_t = -H_t^{-1} g_t.\]

Then

\[\theta_{t+1} = \theta_t + p_t = \theta_t - H_t^{-1} g_t.\]

This is gradient descent with the gradient rescaled by curvature.

Why Newton can be dramatically faster

On a quadratic objective,

\[L(\theta) = \frac{1}{2}\theta^\top H \theta + b^\top \theta + c,\]

Newton’s method jumps directly to the minimizer in one step, because the quadratic model is exact.

That sounds unbeatable. The catch is that neural networks are not small quadratics:

the Hessian is enormous
forming it explicitly is expensive
inverting it is even more expensive
away from minima, it may be indefinite or nearly singular

So even though the formula is elegant, naive Newton updates are rarely practical for deep nets.

Damping and trust-region thinking

When the Hessian is unstable, a standard fix is damping:

\[p_t = -(H_t + \lambda I)^{-1} g_t\]

with $\lambda > 0$.

Interpretation:

if $\lambda$ is large, the step becomes more gradient-like
if $\lambda$ is small and $H_t$ is well-conditioned, the step becomes more Newton-like

This is the basic idea behind Levenberg-Marquardt and trust-region methods.

What deep learning uses instead

Because full Newton is too expensive, large-scale training usually relies on approximations:

quasi-Newton methods such as L-BFGS
Gauss-Newton approximations
natural-gradient and Fisher-matrix approximations
Kronecker-factored methods such as K-FAC

These methods try to capture some curvature information without storing the full Hessian.

Live demo

The demo compares:

gradient descent
pure Newton
damped Newton

on the classic Rosenbrock function, whose banana-shaped valley makes first-order methods work hard. The blue arrow is the raw negative gradient; the gold arrow is the Newton direction.

Things to observe

Gradient descent keeps crossing the valley because the gradient points mostly sideways.
Newton often turns sharply toward the valley floor because the Hessian rescales directions by curvature.
When the Hessian becomes awkward, the damped method behaves more conservatively than pure Newton.
The eigenvalue readout explains why some steps are trustworthy and others are dangerous.

Key takeaways

Newton’s method comes from minimizing the second-order Taylor approximation of the loss.
Curvature-aware steps can be far more direct than plain gradient descent.
The obstacle in deep learning is not theory but scale: Hessians are huge and expensive.
Modern second-order methods are mostly clever approximations to the ideal update $-H^{-1}g$.

Numeric Jungle Navigating the Road to Reality.