A DQN agent learns to navigate a grid world to reach the goal (green) while avoiding hazards (red). Watch the Q-value heatmap update as the replay buffer fills and the target network stabilizes.

Grid world

Background: max Q-value per cell (brighter = higher expected return). Agent shown in blue.

Episode return

Each episode runs until the goal is reached or the step limit expires. Returns rise as the policy improves.