RL 101 - lesson 2

Lesson 2 — Policy Gradients for Precision Lunar Landings

To move beyond balancing tasks, we will control a simulated Lunar Lander that must throttle two engines to touch down softly between mountainous pads. The dynamics are two-dimensional and underactuated, so discrete Q-learning struggles. This makes it a great playground for policy gradients, which directly optimize thrust decisions without enumerating actions.

Environment recap

The OpenAI Gym LunarLanderContinuous-v2 environment exposes a six-dimensional state and two continuous actions:

\[\mathbf{s} = \begin{bmatrix} x & y & v_x & v_y & \theta & \dot\theta \end{bmatrix}^\top, \qquad \mathbf{a} = \begin{bmatrix} u_{main} & u_{side} \end{bmatrix},\]

where the actions span $[-1, 1]$. Rewards blend proximity, velocity damping, leg contacts, and fuel usage:

\[R = r_{pos} + r_{vel} + r_{legs} - 0.3\,|u_{main}| - 0.03\,|u_{side}|.\]

Touching down within the target box while keeping $|v|$ small yields $R \approx 200$; crashing or drifting away incurs negative return.

MDP fundamentals revisited

The lander behaves like any Markov Decision Process with tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$. We optimize the discounted objective

\[J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta} \bigg[ \sum_{t=0}^{T} \gamma^t r(\mathbf{s}_t, \mathbf{a}_t) \bigg].\]

The Bellman equation still holds for the state-value function:

\[V^{\pi}(\mathbf{s}) = \mathbb{E}_{\mathbf{a}\sim\pi, \mathbf{s}'\sim P}\left[r(\mathbf{s}, \mathbf{a}) + \gamma V^{\pi}(\mathbf{s}')\right],\]

but instead of bootstrapping a Q-table we will differentiate $J$ directly with respect to policy parameters.

Vanilla policy gradient

Given a differentiable policy $\pi_\theta(\mathbf{a}\mid\mathbf{s})$, the gradient estimator is

\[\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\pi} \left[ \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(\mathbf{a}_t\mid\mathbf{s}_t)\, G_t \right],\]

where $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$ is the empirical return. In practice we subtract a baseline (typically a learned $V_\phi(\mathbf{s})$) to reduce variance and obtain the advantage $A_t = G_t - V_\phi(\mathbf{s}_t)$.

For Gaussian policies, our actor outputs mean $\mu_\theta(\mathbf{s})$ and standard deviation $\sigma_\theta$, giving

\[\pi_\theta(\mathbf{a}\mid\mathbf{s}) = \mathcal{N}\big(\mu_\theta(\mathbf{s}), \operatorname{diag}(\sigma_\theta^2)\big),\]

and the log-probability term becomes

\[\log \pi_\theta = -\frac{1}{2} \sum_i \left[ \frac{(a_i - \mu_i)^2}{\sigma_i^2} + 2\log \sigma_i + \log 2\pi \right].\]

Advantage Actor-Critic (A2C) blueprint

  1. Sample rollouts. Run the actor in the environment for $T$ steps, recording $\mathbf{s}_t, \mathbf{a}_t, r_t$.
  2. Estimate advantages. Use $\hat A_t = \delta_t + (\gamma\lambda) \hat A_{t+1}$ with TD residual $\delta_t = r_t + \gamma V_\phi(\mathbf{s}{t+1}) - V\phi(\mathbf{s}_t)$.
  3. Update critic. Minimize the regression loss $L_V = \frac{1}{N}\sum_t \big(V_\phi(\mathbf{s}_t) - G_t\big)^2$.
  4. Update actor. Ascend the policy gradient with entropy regularization $H[\pi_\theta]$ to keep exploration wide.

Pseudo-code:

python for iteration in range(num_iters): traj = rollout(env, policy, T) returns, adv = estimate_advantages(traj, value_fn, gamma, lam) policy_loss = -(log_prob(traj.actions) * adv + beta * entropy(traj)).mean() value_loss = 0.5 * (value_fn(traj.states) - returns).pow(2).mean() optimize(policy, policy_loss) optimize(value_fn, value_loss)

Reading the landing math

During landing, we care about minimizing a quadratic form that penalizes deviations from the pad center and high vertical velocity:

\[\mathcal{L}(\mathbf{s}) = (x - x_\star)^2 + 0.5(y - y_\star)^2 + 0.1 v_x^2 + 0.6 v_y^2 + 0.05\theta^2.\]
Reward shaping adds $-\mathcal{L}(\mathbf{s})$ at each step, plus leg-contact bonuses when $ v_y $ and $ \theta $ are small. This keeps the gradient signal informative while staying consistent with the sparse terminal reward.

Practical tips

  • Normalize observations. Standardize each state component using running mean/variance before feeding the neural networks.
  • Clip gradients. Policy gradients can explode when a single trajectory earns large return—apply global norm clipping (e.g., 0.5).
  • Curriculum. Start the lander slightly above the pad with low velocity, then widen the random initialization range as the policy stabilizes.
  • Evaluation protocol. Track both mean and worst-case return across seeded environments to ensure gentle touchdowns aren’t rare events.

Hands-on training visualization

Launch the in-browser REINFORCE trainer below to see returns climb as the policy learns throttle timing. The simulator mirrors the reward shaping we just derived, yet runs entirely client-side so you can iterate without leaving the post.

Prefer a dedicated tab? Open Train Deep Learning → Lunar Lander for a full-width view.

What’s next?

Lesson 3 will turn this stochastic policy view into trust-region methods—ensuring each update stays within a KL divergence budget so your lander doesn’t suddenly somersault toward the lunar surface.