Train a tiny policy-gradient agent to pilot a minimalist lunar lander. Everything happens in the browser: the physics simulator, REINFORCE updates, and the visualization of the lander firing its thrusters toward the landing pad.
The canvas replays the best policy discovered so far. Once training improves, the lander will flare just before touchdown.
Each episode computes Monte-Carlo returns and nudges the policy via REINFORCE with an exponential moving baseline.