RL 101 - lesson 2
01 Nov 2025
Lesson 2 — Policy Gradients for Precision Lunar Landings
To move beyond balancing tasks, we will control a simulated Lunar Lander that must throttle two engines to touch down softly between mountainous pads. The dynamics are two-dimensional and underactuated, so discrete Q-learning struggles. This makes it a great playground for policy gradients, which directly optimize thrust decisions without enumerating actions.
Environment recap
The OpenAI Gym LunarLanderContinuous-v2 environment exposes a six-dimensional state and two continuous actions:
where the actions span $[-1, 1]$. Rewards blend proximity, velocity damping, leg contacts, and fuel usage:
\[R = r_{pos} + r_{vel} + r_{legs} - 0.3\,|u_{main}| - 0.03\,|u_{side}|.\]Touching down within the target box while keeping $|v|$ small yields $R \approx 200$; crashing or drifting away incurs negative return.
MDP fundamentals revisited
The lander behaves like any Markov Decision Process with tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$. We optimize the discounted objective
\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \bigg[ \sum_{t=0}^{T} \gamma^t r(\mathbf{s}_t, \mathbf{a}_t) \bigg].\]The Bellman equation still holds for the state-value function:
\[V^{\pi}(\mathbf{s}) = \mathbb{E}_{\mathbf{a}\sim\pi, \mathbf{s}'\sim P}\left[r(\mathbf{s}, \mathbf{a}) + \gamma V^{\pi}(\mathbf{s}')\right],\]but instead of bootstrapping a Q-table we will differentiate $J$ directly with respect to policy parameters.
Vanilla policy gradient
Given a differentiable policy $\pi_\theta(\mathbf{a}\mid\mathbf{s})$, the gradient estimator is
\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_t\mid\mathbf{s}_t)\, G_t \right],\]where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the empirical return. In practice we subtract a baseline (typically a learned $V_\phi(\mathbf{s})$) to reduce variance and obtain the advantage $A_t = G_t - V_\phi(\mathbf{s}_t)$.
For Gaussian policies, our actor outputs mean $\mu_\theta(\mathbf{s})$ and standard deviation $\sigma_\theta$, giving
\[\pi_\theta(\mathbf{a}\mid\mathbf{s}) = \mathcal{N}\big(\mu_\theta(\mathbf{s}), \operatorname{diag}(\sigma_\theta^2)\big),\]and the log-probability term becomes
\[\log \pi_\theta = -\frac{1}{2} \sum_i \left[ \frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2\log \sigma_i + \log 2\pi \right].\]Advantage Actor-Critic (A2C) blueprint
- Sample rollouts. Run the actor in the environment for $T$ steps, recording $\mathbf{s}_t, \mathbf{a}_t, r_t$.
- Estimate advantages. Use $\hat A_t = \delta_t + (\gamma\lambda) \hat A_{t+1}$ with TD residual $\delta_t = r_t + \gamma V_\phi(\mathbf{s}{t+1}) - V\phi(\mathbf{s}_t)$.
- Update critic. Minimize the regression loss $L_V = \frac{1}{N}\sum_t \big(V_\phi(\mathbf{s}_t) - G_t\big)^2$.
- Update actor. Ascend the policy gradient with entropy regularization $H[\pi_\theta]$ to keep exploration wide.
Pseudo-code:
python
for iteration in range(num_iters):
traj = rollout(env, policy, T)
returns, adv = estimate_advantages(traj, value_fn, gamma, lam)
policy_loss = -(log_prob(traj.actions) * adv + beta * entropy(traj)).mean()
value_loss = 0.5 * (value_fn(traj.states) - returns).pow(2).mean()
optimize(policy, policy_loss)
optimize(value_fn, value_loss)
Reading the landing math
During landing, we care about minimizing a quadratic form that penalizes deviations from the pad center and high vertical velocity:
\[\mathcal{L}(\mathbf{s}) = (x - x_\star)^2 + 0.5(y - y_\star)^2 + 0.1 v_x^2 + 0.6 v_y^2 + 0.05\theta^2.\]| Reward shaping adds $-\mathcal{L}(\mathbf{s})$ at each step, plus leg-contact bonuses when $ | v_y | $ and $ | \theta | $ are small. This keeps the gradient signal informative while staying consistent with the sparse terminal reward. |
Practical tips
- Normalize observations. Standardize each state component using running mean/variance before feeding the neural networks.
- Clip gradients. Policy gradients can explode when a single trajectory earns large return—apply global norm clipping (e.g., 0.5).
- Curriculum. Start the lander slightly above the pad with low velocity, then widen the random initialization range as the policy stabilizes.
- Evaluation protocol. Track both mean and worst-case return across seeded environments to ensure gentle touchdowns aren’t rare events.
Hands-on training visualization
Launch the in-browser REINFORCE trainer below to see returns climb as the policy learns throttle timing. The simulator mirrors the reward shaping we just derived, yet runs entirely client-side so you can iterate without leaving the post.
Prefer a dedicated tab? Open Train Deep Learning → Lunar Lander for a full-width view.
What’s next?
Lesson 3 will turn this stochastic policy view into trust-region methods—ensuring each update stays within a KL divergence budget so your lander doesn’t suddenly somersault toward the lunar surface.