RL 101 - Lesson 8 - Deep Q-Networks
12 Nov 2025
Supervised learning requires labeled examples. Reinforcement learning replaces labels with rewards from experience. Deep Q-Networks (DQN) combine Q-learning with a neural function approximator to let agents learn policies directly from raw observations.
The Q-function
For a policy $\pi$ in a Markov Decision Process, the action-value function is
\[Q^\pi(s, a) = \mathbb{E}_\pi\!\left[\sum_{t=0}^\infty \gamma^t r_t \,\Big|\, s_0 = s,\, a_0 = a\right].\]The optimal $Q^*(s, a)$ satisfies the Bellman optimality equation:
\[Q^*(s, a) = r + \gamma \max_{a'} Q^*(s', a').\]DQN approximates $Q^*$ with a neural network $Q_\theta$ and minimizes the TD loss:
\[\mathcal{L} = \bigl(r + \gamma \max_{a'} Q_{\bar\theta}(s', a') - Q_\theta(s, a)\bigr)^2,\]where $Q_{\bar\theta}$ is a target network — a periodically copied snapshot of $Q_\theta$ — that stabilizes the regression targets.
Two key tricks
-
Experience replay: transitions $(s, a, r, s’)$ are stored in a buffer and sampled randomly for each gradient step. This breaks temporal correlations that destabilize gradient descent.
-
Target network: using $Q_{\bar\theta}$ rather than $Q_\theta$ for the bootstrap target prevents the “chasing a moving target” problem that causes divergence in naive Q-learning.
Live demo
The grid world has a goal (green) and obstacles (red). The agent (blue) starts at a random cell each episode. Watch the episode return climb and the action arrows converge to a consistent policy.
Key takeaways
- DQN makes deep RL practical by separating gradient updates (replay buffer) from environment interactions (ε-greedy rollouts).
- The target network is a simple but critical stabilization trick — copying weights every 50 gradient steps works well for small problems.
- ε-greedy decay balances exploration early on with exploitation once a policy starts to form.
Next up — Lessons 9 & 10: we leave grid worlds behind and tackle continuous control with an inverted pendulum and a lunar lander.