Robonaissance

Robonaissance

The Journey of RL, Part 4: The Exploration Question

Before an agent can optimize anything, it has to figure out what is worth measuring. Exploration is the part of reinforcement learning that admits the agent does not yet know what to optimize.

Hugo's avatar
Hugo
Jun 19, 2026
∙ Paid

By 2015, the deep Q-network could play Atari at a level that beat professional human testers across most of the forty-nine games it was trained on. It learned to dig tunnels in Breakout, to line up rows of invaders in Space Invaders, to drive the race cars in Enduro past hundreds of competitors. On one game, it scored zero. Not a low score. Zero. After two hundred million frames of training, on a game where a human expert routinely scored over four thousand points, the best reinforcement learning agent in the world had learned nothing at all.

The game was Montezuma’s Revenge, a 1984 platformer in which a character named Panama Joe navigates a pyramid of rooms filled with ladders, ropes, locked doors, keys, and enemies. To a human player, it is a recognizable kind of game. To DQN, it was a wall. The same algorithm that had mastered dozens of other games could not, on this one, get past the first room.

Something about Montezuma’s Revenge broke the assumption that had made DQN work everywhere else. Understanding what it was, and what the field did to get around it, is the subject of Part 4.


The Sparse Reward Wall

To understand why DQN failed, look at what the agent has to do to score its first point. The character begins on a platform near the top of the first room. To get the game’s first reward, a golden key, the agent must climb down a ladder, jump onto a rope, jump from the rope onto a raised platform, climb down a second ladder, walk left along the floor while timing a jump to avoid a patrolling enemy, and then climb a final ladder to reach the key. This is a sequence of roughly a dozen precise actions, each of which must be executed in order, with no reward of any kind until the entire sequence is complete.

DQN learns by temporal-difference updates. It adjusts its estimate of a state’s value based on the reward it receives plus the estimated value of the next state. The entire machinery depends on reward signals arriving frequently enough to propagate backward through the sequence of states the agent visits. In Breakout, the agent breaks a brick and gets a point within seconds of almost any reasonable action. The reward signal is dense. It arrives constantly, and the temporal-difference updates have something to work with at every step.

In Montezuma’s Revenge, the reward signal is sparse. The agent can wander the first room for an entire episode, executing thousands of actions, and receive a reward of exactly zero. Without a reward signal, the temporal-difference update has nothing to propagate. Every state looks as worthless as every other state, because every state has yielded the same reward, which is none. The value function the agent learns is flat. It is a perfect map of an environment in which nothing matters, because the agent has never experienced anything mattering.

DQN’s only mechanism for trying new things was a strategy called epsilon-greedy. Most of the time, the agent took the action its value function rated highest. A small fraction of the time, set by a parameter called epsilon, it took a random action instead. This was the agent’s entire exploration strategy: occasionally, do something random. In a dense-reward game, epsilon-greedy is enough, because the agent only needs to stumble onto a rewarding action once and the value function will start to learn. In Montezuma’s Revenge, epsilon-greedy is hopeless. The probability that a sequence of a dozen specific actions will be produced by occasional random deviations, in the right order, before the episode ends, is so close to zero that in two hundred million frames it never happened.

It is worth sitting with the scale of the problem. Suppose the agent needs twelve specific actions in sequence, out of eighteen possible actions at each step, to reach the first key. If the agent were choosing uniformly at random, the chance of producing exactly that sequence would be one in eighteen to the twelfth power, which is more than a quadrillion: millions of times more possible sequences than the two hundred million frames DQN trained on. Real exploration is not quite that bad, because the value function biases the agent toward some actions and many near-correct sequences make partial progress. But the order of magnitude is the point. Random deviation does not find a dozen-step solution in a space this large. And the first key is only the beginning. The first level of Montezuma’s Revenge has twenty-four rooms arranged in a pyramid, each requiring its own sequence of keys, doors, and hazards. The first reward is the easy part. The agent that has not solved the first room has not even begun to face the game.

This is the exploration problem, and it is distinct from everything Parts 1 through 3 examined. Those parts asked how an agent learns the value of states and actions, how it represents a policy, and how it follows the gradient of expected reward. All of that machinery assumes the agent is receiving a reward signal it can learn from. The exploration problem is what happens before any of that machinery can engage. It is the problem of an agent that has been given an objective but cannot find a single example of making progress toward it. The optimization methods of the first three parts are answers to the question of how to climb a hill. Exploration is the question of what to do when you cannot find the hill, when the entire landscape is flat as far as you can sense, and the one peak that matters is a dozen blind steps away in a direction you have no reason to prefer.

User's avatar

Continue reading this post for free, courtesy of Hugo.

Or purchase a paid subscription.
© 2026 Robonaissance · Privacy ∙ Terms ∙ Collection notice
Start your SubstackGet the app
Substack is the home for great culture