RL 101 - lesson 2

01 Nov 2025

Lesson 2 — Policy Gradients for Precision Lunar Landings

To move beyond balancing tasks, we will control a simulated Lunar Lander that must throttle two engines to touch down softly between mountainous pads. The dynamics are two-dimensional and underactuated, so discrete Q-learning struggles. This makes it a great playground for policy gradients, which directly optimize thrust decisions without enumerating actions.

Environment recap

The OpenAI Gym LunarLanderContinuous-v2 environment exposes a six-dimensional state and two continuous actions:

\[\mathbf{s} = \begin{bmatrix} x & y & v_x & v_y & \theta & \dot\theta \end{bmatrix}^\top, \qquad \mathbf{a} = \begin{bmatrix} u_{main} & u_{side} \end{bmatrix},\]

where the actions span $[-1, 1]$. Rewards blend proximity, velocity damping, leg contacts, and fuel usage:

\[R = r_{pos} + r_{vel} + r_{legs} - 0.3\,|u_{main}| - 0.03\,|u_{side}|.\]

Touching down within the target box while keeping $|v|$ small yields $R \approx 200$; crashing or drifting away incurs negative return.

MDP fundamentals revisited

The lander behaves like any Markov Decision Process with tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$. We optimize the discounted objective

\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \bigg[ \sum_{t=0}^{T} \gamma^t r(\mathbf{s}_t, \mathbf{a}_t) \bigg].\]

The Bellman equation still holds for the state-value function:

\[V^{\pi}(\mathbf{s}) = \mathbb{E}_{\mathbf{a}\sim\pi, \mathbf{s}'\sim P}\left[r(\mathbf{s}, \mathbf{a}) + \gamma V^{\pi}(\mathbf{s}')\right],\]

but instead of bootstrapping a Q-table we will differentiate $J$ directly with respect to policy parameters.

Vanilla policy gradient

Given a differentiable policy $\pi_\theta(\mathbf{a}\mid\mathbf{s})$, the gradient estimator is

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_t\mid\mathbf{s}_t)\, G_t \right],\]

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the empirical return. In practice we subtract a baseline (typically a learned $V_\phi(\mathbf{s})$) to reduce variance and obtain the advantage $A_t = G_t - V_\phi(\mathbf{s}_t)$.

For Gaussian policies, our actor outputs mean $\mu_\theta(\mathbf{s})$ and standard deviation $\sigma_\theta$, giving

\[\pi_\theta(\mathbf{a}\mid\mathbf{s}) = \mathcal{N}\big(\mu_\theta(\mathbf{s}), \operatorname{diag}(\sigma_\theta^2)\big),\]

and the log-probability term becomes

\[\log \pi_\theta = -\frac{1}{2} \sum_i \left[ \frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2\log \sigma_i + \log 2\pi \right].\]

Advantage Actor-Critic (A2C) blueprint

Sample rollouts. Run the actor in the environment for $T$ steps, recording $\mathbf{s}_t, \mathbf{a}_t, r_t$.
Estimate advantages. Use $\hat A_t = \delta_t + (\gamma\lambda) \hat A_{t+1}$ with TD residual $\delta_t = r_t + \gamma V_\phi(\mathbf{s}{t+1}) - V\phi(\mathbf{s}_t)$.
Update critic. Minimize the regression loss $L_V = \frac{1}{N}\sum_t \big(V_\phi(\mathbf{s}_t) - G_t\big)^2$.
Update actor. Ascend the policy gradient with entropy regularization $H[\pi_\theta]$ to keep exploration wide.

Pseudo-code:

python for iteration in range(num_iters): traj = rollout(env, policy, T) returns, adv = estimate_advantages(traj, value_fn, gamma, lam) policy_loss = -(log_prob(traj.actions) * adv + beta * entropy(traj)).mean() value_loss = 0.5 * (value_fn(traj.states) - returns).pow(2).mean() optimize(policy, policy_loss) optimize(value_fn, value_loss)

Reading the landing math

During landing, we care about minimizing a quadratic form that penalizes deviations from the pad center and high vertical velocity:

\[\mathcal{L}(\mathbf{s}) = (x - x_\star)^2 + 0.5(y - y_\star)^2 + 0.1 v_x^2 + 0.6 v_y^2 + 0.05\theta^2.\]

Reward shaping adds $-\mathcal{L}(\mathbf{s})$ at each step, plus leg-contact bonuses when $

v_y

$ and $

\theta

$ are small. This keeps the gradient signal informative while staying consistent with the sparse terminal reward.

Practical tips

Normalize observations. Standardize each state component using running mean/variance before feeding the neural networks.
Clip gradients. Policy gradients can explode when a single trajectory earns large return—apply global norm clipping (e.g., 0.5).
Curriculum. Start the lander slightly above the pad with low velocity, then widen the random initialization range as the policy stabilizes.
Evaluation protocol. Track both mean and worst-case return across seeded environments to ensure gentle touchdowns aren’t rare events.

Hands-on training visualization

Launch the in-browser REINFORCE trainer below to see returns climb as the policy learns throttle timing. The simulator mirrors the reward shaping we just derived, yet runs entirely client-side so you can iterate without leaving the post.

Prefer a dedicated tab? Open Train Deep Learning → Lunar Lander for a full-width view.

What’s next?

Lesson 3 will turn this stochastic policy view into trust-region methods—ensuring each update stays within a KL divergence budget so your lander doesn’t suddenly somersault toward the lunar surface.

Numeric Jungle