The Journey of RL, Part 5: The Internal Model

Self-play and world models converge on a single idea. When an agent has a model of the thing it is optimizing against, the reward signal can come from inside.

Jun 23, 2026

∙ Paid

For thirty-six moves, the second game looked ordinary. Lee Sedol, the strongest Go player of his generation, had spent the opening trading territory in the corners with his opponent, the patient back-and-forth that begins every high-level game. Then, on move thirty-seven, his opponent placed a black stone on the fifth line, in a stretch of the board that convention said was far too early to contest. The commentators assumed it was a mistake. One of them said aloud that he was sure a human professional would never have played it. Lee Sedol stared at the board, stood up, and left the room.

He was gone for nearly fifteen minutes. His opponent could not see him leave, could not feel the weight of the moment, could not have cared if it tried. It was AlphaGo, a program built by a team at DeepMind led by David Silver, and the move it had just played would later be assigned, by the program’s own estimate, a probability of roughly one in ten thousand that a human would ever have chosen it. Fan Hui, the European champion who had been advising the DeepMind team, looked at the stone and said it was not a human move. He also said it was beautiful. By the end of the game, the move that looked like a mistake had decided it. AlphaGo won the match four games to one.

The question that hung over the result was not whether a machine could play Go. It was where move thirty-seven had come from. No human had taught it. It had not appeared in any record of human play, because no human played that way. The move had come out of the only place it could have come from: the millions of games AlphaGo had played against itself. Part 5 is about what it means for an agent to learn by optimizing against a model of the world, including a model of itself, and where the reward comes from when no one outside is providing it.

The Sample-Efficiency Wall

To see why self-play mattered, start with what it replaced. The reinforcement learning of the first four parts of this series was, with few exceptions, model-free. The agent did not have a model of how its environment worked. It learned by acting in the world, observing the reward, and adjusting its value estimates or its policy in response. DQN learned to play Atari this way, and so did the policy gradient methods of Part 3. The approach worked, but it had a cost that grew more conspicuous the more ambitious the task became. Model-free learning is expensive in experience. It needs to actually live through the consequences of its actions, many millions of times, before its estimates converge.

The numbers make the cost concrete. DQN needed two hundred million frames of Atari, the equivalent of weeks of continuous play, to master a single game. That was acceptable when the environment was an emulator that could run faster than real time. It became a wall the moment the task was something that could not be sped up so cheaply. A robot learning to walk cannot fall over two hundred million times. A self-driving system cannot crash its way to competence. For any problem where experience is slow, dangerous, or expensive to gather, the sample inefficiency of model-free learning was not an inconvenience. It was a barrier between reinforcement learning and most of the problems anyone actually wanted to solve.

There was an obvious question buried in this, and it had been asked since the earliest days of the field. Was there a smarter way to learn than experiencing everything? A human learning to drive does not need a million crashes. They build a mental model of how the car behaves and rehearse against that model, imagining what would happen if they turned too sharply, without actually doing it. The model lets them practice without paying the full price of every mistake. Richard Sutton had formalized a version of this idea as early as 1991, in an architecture he called Dyna, where an agent learned a model of its environment and used that model to generate simulated experience, learning from imagined transitions alongside real ones. The intuition was old. What was missing for most of the field’s history was a way to make it work at scale, on problems hard enough to matter.

Board games occupied a peculiar position in this landscape. A board game is a problem where the model is free. The rules of Go are known exactly, which means a Go-playing agent does not need to learn how its environment behaves. It can compute the consequence of any move precisely, because the rules tell it. And this changes the economics of experience completely. If the agent has a perfect model of the game, it does not need to play against humans, or against a fixed opponent, or against anything external at all. It can play against itself, as many times as it has the compute to run, each game generating training signal at the speed of the processor rather than the speed of the world.

This is the move that broke the sample-efficiency wall for board games, and it is subtler than it first appears. The agent that plays against itself is not drawing on any external supply of games. It is generating its own experience, scoring it against its own knowledge of the rules, and improving by playing a copy of itself that improves in lockstep. The reward, the win or the loss, comes from inside the system. The opponent’s skill is always exactly matched to the agent’s own, because the opponent is the agent.

The scale this unlocks is hard to overstate. A human professional might play fifty thousand serious games of Go in a lifetime of study. A self-play agent can play that many before lunch, and tens of millions more by the end of a training run, each one scored, each one feeding back into the network. The bottleneck of human experience, the slow accumulation of games across a career, simply dissolves when the agent is both players and the referee. The question that opens the rest of this part is what happens when you build a learner this way: an agent whose teacher, whose opponent, and whose source of experience are all versions of itself. The agent needs something to practice against, and the only thing fast enough, and always exactly as good as it is, is itself.

Continue reading this post for free, courtesy of Hugo.

Or purchase a paid subscription.

Robonaissance

The Journey of RL, Part 5: The Internal Model

Self-play and world models converge on a single idea. When an agent has a model of the thing it is optimizing against, the reward signal can come from inside.

The Sample-Efficiency Wall

Continue reading this post for free, courtesy of Hugo.