The RL Spiral, Part 4: When RL Learned to See
For decades, every RL system needed a human to tell it what to look at. Then one opened its own eyes. The headlines went to the neural network. The less celebrated ingredient came from a sleeping rat.
This is the fourth article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, The Curse Bellman Couldn’t Break, explained why RL needs so much experience. This one is about the moment that changed.
In December 2013, a team at a small London startup called DeepMind posted a paper to an online archive. The paper described a system that could learn to play seven different Atari 2600 video games, using the same algorithm, the same network architecture, and the same settings for all seven. The system received no information about the rules of any game. It saw only what a human player would see: the pixels on the screen and the score. From that, it learned to play. On six of the seven games it outperformed every previous algorithm. On three of them it surpassed a human expert.
The games ranged from Pong, where a paddle bounces a ball back and forth, to Space Invaders, where rows of aliens descend while the player shoots upward. The system knew nothing about paddles, balls, aliens, or shooting. It saw a grid of colored dots, took actions, and observed what happened to the score. Over the course of roughly ten million frames of play, patterns emerged. In Pong, the system learned to keep the paddle directly under the ball. In Breakout, where the player bounces a ball upward to destroy rows of colored bricks, the system initially learned the obvious strategy: aim for the nearest bricks. Then, after hundreds more episodes, it discovered something subtler. It drilled a narrow tunnel through the side of the brick wall and sent the ball through to the space behind. Once behind the wall, the ball bounced back and forth, destroying entire rows from above without any further input from the paddle. This is a strategy that human players take significant practice to discover. The system found it from pixels alone, without anyone suggesting it was possible.
No one on the team programmed this strategy. No one on the team expected it. The system had found it through the same process Thorndike’s cat used in 1897: trying things, seeing what produced reward, and doing more of what worked. But the cat had a simple latch. The Breakout agent had discovered a multi-step exploit in a complex visual environment, starting from nothing but a score and a screen full of dots.
The result was striking, but the method was more striking. Previous RL systems that played games had been given carefully designed features: the position of the ball, the location of the paddle, the coordinates of the enemies. Researchers who understood the game decided in advance which variables mattered, measured those variables, and handed them to the RL algorithm as input. The features did the seeing. The RL did the deciding. This division of labor worked, but it meant that every new game required a new set of hand-built features. The representation was not learned. It was engineered.
The DeepMind system eliminated that step entirely. It used a convolutional neural network, the kind of network that had recently revolutionized image recognition in supervised learning, and wired it directly to an RL algorithm called Q-learning, where Q stands for the quality of an action: how good is this move, from this position, in the long run? Raw pixels went in one end. A value estimate for each possible action came out the other. The network had to learn, simultaneously, what to look at and what to do about it. Representation and decision-making, learned together, from scratch, from nothing but pixels and a score.
This was the answer, or at least the beginning of an answer, to the problem the previous article described. The curse of dimensionality makes RL impractical in high-dimensional spaces because the algorithm must spread its experience across too many possible states. The brain circumvents the curse by compressing high-dimensional input into low-dimensional representations before making decisions. DQN, short for Deep Q-Network, as the system came to be called, did the same thing with a convolutional network. The network compressed raw pixel input into a compact set of learned features. The RL algorithm learned a policy, a strategy for choosing what to do in each situation, on top of those features. The compression and the learning happened simultaneously, each shaping the other. For the first time, an RL system could build its own eyes.
Two years later, in February 2015, DeepMind published an expanded version of the work in Nature, testing the same system on forty-nine Atari games. Using the same algorithm, architecture, and settings for every game, it matched or surpassed a professional human player on more than half of the forty-nine games. The achievement was not that a computer could play one game well. Computers had been doing that since Deep Blue beat Kasparov at chess in 1997. The achievement was that a single system, with no game-specific knowledge, could learn dozens of different games to human level or beyond. The Nature paper became one of the most cited in artificial intelligence. Google had acquired DeepMind shortly after the original 2013 result. The field of deep reinforcement learning was born.
But the achievement had a technical ingredient that was at least as important as the neural network, and far less celebrated. Without it, the system did not learn. With it, it did. The ingredient was called experience replay, and its story goes back two decades before DQN, and its roots go deeper still, into the biology of how the brain consolidates memory and simulates futures.
What Experience Replay Actually Solved
The problem experience replay solved is subtle but fundamental.
When an RL agent learns from its own experience in real time, the data it learns from has a dangerous property: consecutive experiences are correlated. A Breakout agent bouncing a ball off the left wall generates a long run of training data that all looks similar: ball on the left, paddle on the left, reward for hitting bricks on the left. The neural network adjusts its weights to get better at this particular situation. Then the ball shifts right, and the character of the data changes abruptly. The network adjusts again, overwriting what it just learned. It gets better at the right side and worse at the left side. Then the ball shifts again.
Why does learning one thing erase another? Because a neural network has one shared set of internal settings. The same weights that determine how it handles the ball on the left also determine how it handles the ball on the right. There is no separate drawer for left-wall knowledge and right-wall knowledge. When a long stream of left-wall data pushes the weights in one direction, the network’s ability to handle right-wall situations degrades, because those abilities depend on the same weights that just moved. It is like adjusting a camera lens to focus on something nearby: the background blurs, not because the lens broke, but because the same glass serves both purposes.
The result is catastrophic. The network oscillates between competing patterns, never settling into a stable policy. In machine learning, this is called catastrophic forgetting: a neural network trained on a sequence of correlated data overwrites old knowledge with new, losing what it learned before. Imagine a student studying for exams by reading all of biology on Monday, all of chemistry on Tuesday, and all of physics on Wednesday. By Wednesday night, biology is gone. The information was not integrated. It was replaced. A neural network learning from sequential, correlated RL data suffers the same fate. In supervised learning, where you train a network to recognize images, you solve this by shuffling the training data. Show the network dogs and cats in random order, and it learns to recognize both. Show it all the dogs first and then all the cats, and it forgets the dogs.
But an RL agent generates its own data by acting in the world, and the world delivers experiences in the order they happen. A game of Breakout produces experiences in the order the game unfolds. You cannot shuffle reality. You can, however, build a bigger network. More parameters, more capacity, more room for separate internal structures to handle different situations without stepping on each other. Would that be enough? The answer is that scale helps with storage but not with direction. A bigger network can, in principle, hold more knowledge at once. But the problem is not that the network lacks room. The problem is that every training step pushes the weights in whatever direction the current data demands. A hundred consecutive frames of left-wall data produce a hundred consecutive weight updates that all say the same thing: get better at the left wall. It does not matter how many weights there are. If every update pulls them the same way, they all move together. Scale and varied data solve different problems. A bigger network can hold more knowledge. Varied data ensures the training process does not push all that knowledge in one direction. You need both. Today’s largest language models have hundreds of billions of parameters. Their training data is still shuffled. Even at that scale, shuffling is not optional. It is enforced more carefully than ever.
In 1992, a researcher named Long-Ji Lin found a way to get both. Instead of learning from each experience once and discarding it, the agent stores its experiences in a memory buffer. Each experience is a snapshot: the state the agent was in, the action it took, the reward it received, and the state it ended up in. Four pieces of information, frozen in time. The buffer accumulates thousands of these snapshots over the course of training, building up a diverse archive of the agent’s history. When it is time to update the network, the agent does not train on the most recent experience. It reaches into the buffer and pulls out a random batch. An experience from the left wall sits next to one from the right wall, next to one from the moment the ball broke through the bricks. The random sampling breaks the temporal correlation. The learning stabilizes.
This solves two problems at once. The first is the correlation problem: random sampling mixes experiences from different times and situations, so the network sees a varied diet instead of a monotonous stream. The second is data efficiency. Without a buffer, each experience is used once and thrown away. Every interaction with the environment, every frame of the game, teaches the network a single lesson and then vanishes. With a buffer, the same experience can be sampled again and again across many training updates. A rare but important moment, like the first time the ball breaks through the brick wall, is not learned from once and forgotten. It stays in the buffer and keeps teaching. In environments where collecting experience is slow or expensive, as it is in nearly every real-world application, this reuse matters enormously. The agent squeezes more learning from less interaction.
Lin published this idea in a paper titled “Self-Improving Reactive Agents Based on Reinforcement Learning, Planning and Teaching.” It received modest attention at the time. Experience replay was one of several ideas in a dense paper, and in 1992 the neural networks available were too small and the computers too slow to demonstrate the technique’s full power. For two decades, the idea sat in the literature, cited occasionally, waiting for the rest of the field to catch up.
When the DQN team built their Atari system, they found that experience replay was not optional. Without it, the combination of a deep neural network and Q-learning was unstable. The network would learn for a while, then oscillate, then collapse. With experience replay, the same system converged reliably. The DQN team stored one million recent experiences in the buffer and sampled random batches for each training update. The result was a system that could learn stable policies directly from pixels, across dozens of different games, with no hand-built features at all. The successful combination of reinforcement learning with deep neural networks was critically dependent on experience replay. The neural network got the headlines. Experience replay made it work.
The Sleeping Rat
Experience replay is more than a clever engineering trick. The brain has its own version, built into the hippocampus, and it predates computers by roughly two hundred million years.
During sleep, and during quiet rest while awake, the hippocampus replays sequences of neural activity that correspond to recent experiences. Most of what we know about this comes from rats, because their hippocampal neurons can be recorded one at a time while the animal moves freely. In rats running through a maze, researchers can record the activity of place cells, neurons in the hippocampus that each fire when the animal is in a specific location. As the rat moves through the maze, place cells fire in sequence, one after another, tracing the animal’s path like a trail of lights on a map. Cell A fires when the rat is at the start. Cell B fires when it reaches the first turn. Cell C fires at the second turn. The sequence is a neural recording of the journey.
When the rat stops to rest at the end of the maze, or falls asleep in its home cage, something happens. Those same place cells fire again, in the same sequence: A, B, C. But compressed. A path that took several seconds to travel replays in roughly a tenth of a second. The replay occurs during sharp-wave ripples, brief synchronized bursts of electrical activity that sweep through the hippocampus. During sleep, these ripples coincide with slow oscillations in the cortex, as if the hippocampus is transmitting a rapid summary of the day’s experiences to a different part of the brain.
This phenomenon was first observed in the 1980s and has been studied intensively for decades. The evidence that it matters is direct: blocking sharp-wave ripples during sleep impairs the rat’s spatial learning. The animal runs the maze again the next day and performs as if it never learned it. The replay is not incidental neural noise left over from the day. It is part of the learning process itself.
Two roles have been identified. The first is memory consolidation. The hippocampus stores experiences quickly but temporarily. During replay, those experiences are transferred to the cortex, which stores them permanently but learns slowly. The replay serves as a bridge between fast acquisition and slow integration. The second role is planning. Replay sequences do not always trace paths the animal has already taken. Sometimes they run forward from the animal’s current position into paths it has not yet explored, simulating possible futures. Rats sitting at a choice point in a maze show forward replay into both possible paths before choosing. The hippocampus is not just recording the past. It is running simulations.
The parallel with DQN’s experience replay is direct. Both systems store recent experiences. Both replay those experiences offline, detached from the flow of real-time interaction. Both use replay to stabilize and accelerate learning that would otherwise be unstable or slow. The brain replays during sleep and quiet rest. DQN replays by sampling from a memory buffer during training. The mechanism differs. The function is the same: extracting more learning from the same experience by revisiting it.
Whether Lin was inspired by hippocampal replay when he proposed experience replay in 1992 is unclear. His paper did not cite the neuroscience literature. The explicit connection between the two was made later, most influentially by Mattar and Daw in 2018, who built a formal model linking hippocampal replay to reinforcement learning. The DQN team at DeepMind included researchers deeply versed in both fields. Demis Hassabis, DeepMind’s co-founder, had done his doctoral research on memory and imagination in the hippocampus. Whether Hassabis’s neuroscience background directly shaped the decision to use experience replay, or whether Lin’s 1992 idea and the brain’s replay mechanism are genuinely independent solutions to the same problem, the result is the same: a technique that works in silicon and a mechanism that works in biology, solving the same learning problem in the same way. If Hassabis’s background shaped the design, it is a case of neuroscience inspiring RL. If the two arrived at the same solution independently, it is another instance of the convergence this series tracks. Either way, the parallel is structural, not metaphorical.
DQN’s replay was the basic version: store the last million experiences, sample at random. The brain does two things that DQN’s buffer did not. Later RL systems adopted one. The other points to a deeper limitation, one that defines the boundary of what DQN can do.
The brain does not replay randomly. It prioritizes. Experiences associated with reward, or with surprising outcomes, are replayed more frequently than mundane ones. The hippocampus selects for replay the experiences that carry the most learning value. This is the mechanism that Mattar and Daw formalized in 2018: their model showed that a replay system prioritizing the most informative experiences reproduces both the neuroscience data and the performance gains seen in RL. RL researchers had independently developed the same idea under the name prioritized experience replay, which samples experiences with larger prediction errors more often than those that have already been well learned. The improvement in learning speed was substantial. Another convergence: two fields, working separately, arriving at the same refinement.
The gap that remains is directional. The brain replays in both directions. After reaching a goal, hippocampal replay traces the path backward, from the goal to the starting point, propagating reward information to earlier states in the sequence. Before starting a new trial, replay traces forward, simulating possible outcomes from the current position. Backward replay for learning from what just happened. Forward replay for planning what to do next. DQN’s TD learning does propagate reward information backward, but implicitly, one step at a time, over many rounds of training. The brain does it in a single compressed burst. And forward simulation, the ability to mentally test a path before committing to it, DQN cannot do at all. It has no model of the world to simulate with.
Eyes Without Imagination
The combination of deep learning and reinforcement learning that DQN demonstrated became the foundation of nearly every major RL achievement that followed. AlphaGo, which defeated the world champion at Go in 2016, used deep neural networks trained with RL. The language models shaped by RLHF, the technique discussed in the first article of this series, use deep networks throughout. Deep RL became the default paradigm.
And yet, for all its power, DQN revealed the limits of what representation learning alone can solve.
The forty-nine Atari games DQN was tested on vary widely in structure. Some, like Pong and Breakout, reward the agent frequently: every time the ball hits a brick, the score goes up. The feedback loop between action and reward is tight. DQN mastered these completely. Others are different. Montezuma’s Revenge requires the player to navigate a pyramid of connected rooms, each filled with enemies, ladders, ropes, and locked doors. The player must collect keys to open doors, avoid skulls that patrol on fixed paths, and time jumps across moving platforms. Hundreds of actions must be taken in the correct sequence before the first point is scored. One mistake kills the player and resets the room. A random agent pressing buttons at random will almost never stumble into the first key, much less pick it up, carry it to the right door, open the door, and reach the next room. Without ever encountering a positive prediction error, the TD learning machinery knows what kills the agent but has no signal pointing toward the goal. DQN scored close to zero on Montezuma’s Revenge. It was blind in a game that required foresight.
This failure was diagnostic. Montezuma’s Revenge did not expose a bug in DQN. It exposed the boundary of what DQN could do. The game requires not just seeing the screen but planning a long sequence of actions toward a distant goal. It requires exploring rooms that yield no immediate reward, in the hope that they lead somewhere useful eventually. It requires something like curiosity: an internal drive to seek out the unknown even when no external reward is in sight.
DQN had none of this. It had eyes, built from a convolutional network that could look at a screen and extract meaningful features. It had memory, built from an experience replay buffer that stored and replayed past situations. It had the TD learning algorithm at its core, updating its estimates of value one step at a time. What it did not have was a model of the world. It could not ask “what would happen if I went left instead of right?” It could not simulate a sequence of actions and evaluate the outcome without actually performing them. It could only react to what it saw, moment by moment, and learn from the reward that followed. In a game where reward arrives every few seconds, this works. The agent explores randomly, stumbles into reward, and the prediction error signal pulls it toward better behavior. In a game where reward arrives after hundreds of correct decisions, random exploration is a lottery ticket with odds so long that the agent will never win.
This distinction, between systems that learn by reacting and systems that learn by imagining, is one of the oldest in RL. Researchers call it model-free versus model-based. DQN is model-free: it learns a direct mapping from what it sees to how good each action is, without any internal model of how the world works. A model-based system builds an internal simulation of the environment and uses it to plan ahead, to test actions mentally before committing to them physically. The brain does both. When you brake at a red light, you are using a model-free motor policy refined over years of practice. When you plan a route through an unfamiliar city, you are using a model-based simulation of streets and landmarks. DQN gave RL the first half. The second half remained unsolved.
The real world, almost always, looks more like Montezuma’s Revenge than like Pong. A robot learning to set a table must pick up a plate, carry it without dropping it, place it in the right position, then do the same for the silverware, the glass, and the napkin. No reward until the table is set. A self-driving car navigating a complex intersection receives no score for a correct lane change, a well-timed merge, or a cautious decision to wait. The reward, arriving home safely, sits at the end of a journey composed of thousands of decisions. A medical treatment plan produces outcomes over months: a dosage adjustment today, a blood test next week, a symptom change next month. In each case, the reward is sparse, delayed, and ambiguous. The feedback loop that makes Pong learnable in hours makes these problems essentially unlearnable by the same method. DQN’s breakthrough gave RL the ability to see. It did not give RL the ability to imagine what it has not yet seen.
The years after DQN saw the field split in two directions. One branch pursued bigger networks, more compute, and more refined training, pushing deep RL to superhuman performance in games of increasing complexity. This branch bet that representation plus reward plus scale would be enough. It produced AlphaGo, and then AlphaZero, which mastered chess, Go, and shogi by playing against itself millions of times, with no human data at all. These were closed games with clear rules and dense feedback from self-play. Deep RL conquered them decisively.
The other branch asked a different question: instead of learning only a policy, can we build agents that learn a model of how the world works, so they can plan ahead, explore efficiently, and act sensibly even when reward is nowhere in sight? This branch acknowledged that DQN’s eyes were not enough. The agent also needed imagination. That branch leads to world models, the subject of Article 6.
But the first branch raises a question it cannot answer on its own. AlphaGo proved that self-play can produce superhuman intelligence in closed games. Then RLHF proved that human preferences are essential for aligning AI systems that interact with people. If self-play can exceed humans, why do we still need human feedback? And what does it mean that the two most powerful training methods in RL point in opposite directions? That is the subject of the next article.
Next in The RL Spiral: The Self-Play Paradox.


