The RL Spiral, Part 5: The Self-Play Paradox
Self-play produced the strongest Go player in history. Human feedback produced ChatGPT. RL’s two greatest successes pull in opposite directions.
This is the fifth article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, When RL Learned to See, traced how deep learning gave RL the ability to build its own representations. This one is about what happened when RL stopped needing human data entirely, and why it then needed humans more than ever.
On March 10, 2016, in a conference room at the Four Seasons Hotel in Seoul, a Go program made a move that no human would have played.
It was move 37 of the second game. AlphaGo, built by DeepMind, was playing against Lee Sedol, winner of eighteen world titles and widely considered the strongest Go player alive. The board was still open, the positions still developing. AlphaGo placed a stone on the fifth line: a shoulder hit in a position where no professional player would choose it. The commentators, both experienced professionals, paused. One of them laughed, uncertainly. The other said it looked like a mistake.
Lee Sedol stood up from the table and left the room.
When he returned, he played on. He lost. AlphaGo’s move 37 turned out not to be a mistake. It was a strategic play of a kind the commentators had never seen, a move that sacrificed conventional position for long-term influence in a way that contradicted centuries of accumulated Go wisdom. AlphaGo’s own analysis estimated that a human player would make that move roughly one time in ten thousand. The move was not in any database of professional games. It was new. AlphaGo had invented it through self-play: millions of games against copies of itself, each game producing slightly better judgment, each generation of the system serving as both student and teacher.
AlphaGo won the match four games to one. Over 200 million people watched. But the result that mattered most for the field of reinforcement learning came a year and a half later, when DeepMind published a successor system called AlphaGo Zero.
AlphaGo, the version that defeated Lee Sedol, had been trained in two stages. First, it studied a database of 160,000 human games to learn how professionals play. Then it played against itself to improve beyond that human baseline. AlphaGo Zero skipped the first stage entirely. It started from nothing: random play, no human games, no human knowledge. Only the rules of Go and a reward signal for winning.
In three days, AlphaGo Zero played 4.9 million games against itself and surpassed the version that had beaten Lee Sedol, winning 100 games to zero. Given more training time, it surpassed every previous version, becoming what was plausibly the strongest Go player in history. All from self-play. No human data. No human guidance. Just rules, reward, and compute.
This was a proof of concept for something researchers had long theorized but never demonstrated at this scale: that an RL system, given a well-defined game with clear rules and a clear win condition, could discover strategies beyond human understanding through the simple mechanism of playing against itself. Self-play had produced superhuman intelligence. Not intelligence in general, but intelligence in a domain where the goal was unambiguous and the rules were fixed.
The question was whether those conditions would hold outside the game board.
The Closed World
Self-play as a training method did not begin with AlphaGo. Its origins trace to a researcher named Gerald Tesauro at IBM’s Thomas J. Watson Research Center in Yorktown Heights, New York.
In the early 1990s, Tesauro was experimenting with neural networks and temporal difference learning, the algorithm Rich Sutton had described in 1988, the equation at the center of Article 2. He applied both to backgammon. His approach was simple in concept: a neural network would evaluate board positions, and the system would improve by playing against itself, using the TD prediction error to update its evaluations after every game. No human teacher. No database of expert play. Just a network, a game, and the learning signal that comes from discovering whether your predictions were right.
The result surprised everyone, including Tesauro. Starting from random play, the system taught itself backgammon at a level that no previous computer program had reached. After 300,000 self-play games, TD-Gammon version 1.0 played well enough to compete with world-class human players. After 1.5 million games, version 2.1 was nearly indistinguishable from the best humans alive at the game. In a 1998 match against the reigning world champion, it lost by a margin of just eight points over a hundred games.
More remarkable than the level of play was what the system discovered along the way. Backgammon has opening strategies that had been refined by human experts over decades. One standard play, called “slotting,” involved moving a single checker to an advanced position early in the game, accepting some risk for a chance at a strong formation. TD-Gammon evaluated this differently. It preferred a more conservative play that human experts had dismissed. Tournament players, skeptical at first, began experimenting with TD-Gammon’s recommendation. Within a few years, the conventional wisdom had reversed. The machine’s opening play, learned entirely from self-play, replaced the human standard.
TD-Gammon was an early demonstration of a principle that AlphaGo would confirm at vastly larger scale: in a game with fixed rules and a clear outcome, self-play can produce expertise that surpasses human knowledge. The mechanism is straightforward. When a system plays against itself, it faces an opponent at exactly its own level. If the opponent is too weak, the games teach nothing. If the opponent is too strong, the system cannot learn. Self-play keeps the two matched: as the player improves, so does the opponent. Each generation pushes the next.
AlphaZero, published in late 2017, generalized this to three games simultaneously. Using the same algorithm and architecture, with no game-specific tuning, AlphaZero mastered chess, Go, and shogi, the Japanese variant of chess, from scratch through self-play alone. In chess, it defeated Stockfish, the strongest conventional chess engine, playing a style that grandmasters described as creative and, at times, alien. It would sacrifice material for positional advantages in ways no program or human had systematically explored. In Go, it surpassed AlphaGo Zero. In shogi, it defeated the best existing program. Three games, one algorithm, no human knowledge. The principle was general.
But the principle came with a condition that was easy to overlook in the excitement. Every one of these successes shared three properties.
First, the rules were fixed and known. Go has a rulebook. Chess has a rulebook. Backgammon has a rulebook. The system could simulate the game perfectly because the game was perfectly defined.
Second, the outcome was unambiguous. A game of Go ends in a win or a loss. There is no debate about what counts as winning. The reward signal is clean.
Third, the system could generate unlimited experience at negligible cost. A game of Go can be simulated in milliseconds. AlphaGo Zero played 4.9 million games in three days. No physical world, no real stakes, no limit on how many times you can try.
These three properties define what researchers sometimes call a closed world. Self-play thrives in closed worlds. The question is what happens when any one of those properties breaks down.
The Open World
In 2017, researchers at OpenAI and DeepMind published a paper titled “Deep Reinforcement Learning from Human Preferences.” The paper described a method for training RL systems not on a fixed reward function but on human judgments about which behaviors looked better. Show a person two video clips of a simulated robot. Ask which one is doing a better job. Record the preference. Train a reward model on those preferences. Use RL to optimize the agent’s behavior according to the reward model.
The method was called RLHF: reinforcement learning from human feedback. It worked, in a modest way, for teaching simulated robots to do backflips and other physical tasks where the desired behavior was clear to a human viewer but hard to specify as a mathematical function.
Five years later, RLHF transformed the AI industry. Applied to large language models, it was the technique that turned GPT from an autocomplete engine into ChatGPT: a system that follows instructions, declines harmful requests, explains its reasoning, and holds conversations. The first article in this series described RLHF in detail. Here, the question is different. Why was it necessary?
If self-play can produce superhuman Go from nothing but rules and compute, why can it not produce a helpful, honest assistant the same way?
The answer is that a conversation is not a game of Go. None of the three closed-world properties hold.
The rules are not fixed and known. In Go, the rules define what counts as a legal move, how stones are captured, and when the game ends. These rules are complete and unambiguous. A conversation has no equivalent. There is no fixed set of legal moves. A response can be any length, any tone, any topic. The space of possible actions is essentially unbounded, and what constitutes a reasonable action shifts with every exchange. No one can write down the rules of conversation in a form a machine can simulate, because the rules themselves change depending on context.
The outcome is not unambiguous. What counts as a “good” response depends on who is asking, what they need, what they already know, the social context, the cultural context, the stakes of the situation. A response that is perfect for one person is useless for another. Helpfulness is subjective. Honesty is contextual. Harmlessness is contested. There is no equivalent of “you won” or “you lost.” Two thoughtful people can disagree about whether a response was appropriate, and both can be right, because they are applying different values.
The system cannot generate unlimited experience at negligible cost. Self-play in Go works because the game can be perfectly simulated. Language does not have a simulator. A language model playing against itself generates text, but there is no ground truth against which to measure that text. A Go position can be objectively evaluated by playing the game to completion. A conversation cannot. If two copies of a language model debate whether a response is helpful, neither one has access to the truth of the matter. They are exchanging opinions, not computing outcomes.
This is the self-play paradox. The technique that produced the most powerful single achievement in RL history, superhuman Go without human knowledge, is precisely the technique that cannot, by itself, produce the most widely deployed RL application in history, a language model aligned to human values. The power of self-play comes from the fact that the system can be its own judge. The limitation of self-play is that for open-ended tasks, there is no judge without a human.
The paradox is not a failure of self-play. It is a feature of the problem. Self-play works when the goal can be defined formally: maximize the score, win the game, follow the rules to a measurable outcome. It fails when the goal is a human judgment that cannot be formalized without losing what makes it meaningful. “Be helpful” is not a reward function. It is a value.
What Humans Know That Algorithms Don’t
That divide between verifiable goals and human judgment has roots in the brain.
The brain does not learn only from its own experience. It also learns from other people. This seems obvious, but its implications for RL are not.
A child learning to use a knife does not learn purely by trial and error. The child watches a parent. The child notices how the parent holds the handle, how the blade angles against the bread, where the other hand goes to stay safe. The child imitates. The parent corrects. The correction is not a reward signal in the RL sense. It is a transmitted understanding of what “correct” means, delivered through gesture, through language, through demonstration, through the accumulated knowledge of how knives work and why fingers matter.
This capacity, learning from others, is not a minor add-on to the brain’s learning systems. It is a distinct cognitive faculty that emerged through its own evolutionary trajectory. The circuits the brain uses for understanding other people’s actions, intentions, and mental states are among the most distinctive features of human cognition.
In the early 1990s, a team led by Giacomo Rizzolatti at the University of Parma discovered neurons in the premotor cortex of macaque monkeys that fired both when the monkey performed an action and when the monkey watched another individual perform the same action. These cells, which the team called mirror neurons, blurred the line between acting and observing. When you watch someone reach for a cup, some of the same neurons fire that would fire if you were reaching yourself. The brain, in a sense, rehearses other people’s actions internally.
The discovery was controversial and remains debated in its details. But the broader finding, that the brain has dedicated circuitry for understanding the actions and goals of others, has been confirmed through multiple lines of evidence. Humans are, from infancy, exquisitely tuned to reading other people’s behavior. By twelve months, children follow an adult’s gaze to look at what the adult is looking at. By eighteen months, they infer what an adult intends, even when the adult fails: a child who watches an adult try and fail to pull apart a toy will pull the toy apart herself, reproducing the intended action rather than the observed failure.
This capacity enables something no self-play system can replicate: the transmission of goals that have never been formalized.
A parent does not write a reward function for “use a knife safely.” The parent demonstrates, corrects, and explains. The child does not optimize a numerical signal. The child builds an internal model of what the parent wants and why, using circuits that evolved specifically for this purpose. The knowledge of what “correct” means is transmitted socially, through observation and interaction, not discovered through individual trial and error.
This is relevant to RL because it maps directly onto the problem RLHF was built to solve. A language model cannot discover “be helpful” through self-play, because helpfulness is a human judgment. The model needs access to human preferences, just as the child needs access to the parent’s demonstrations. RLHF is, in this light, a crude approximation of social learning: the human evaluator plays the role of the demonstrating parent, and the preference comparison plays the role of the corrective gesture. The mechanism is vastly simpler than what the brain does. But the functional role is the same. It provides the system with a signal about human values that the system cannot generate on its own.
The brain’s social learning machinery has features that current RLHF lacks entirely. A child learning from a parent does not just record preferences. The child builds a model of the parent’s mind: what the parent knows, what the parent wants, what the parent would approve of in a new situation. Developmental psychologists call this theory of mind. By age four, most children can reason about what another person believes, even when that belief is false. This is not imitation. It is inference about an internal mental state that cannot be directly observed. The child is modeling the teacher, not just recording the teacher’s outputs.
RLHF records outputs. A human evaluator says “this response is better than that one.” The reward model learns to predict which responses humans will prefer. But it has no model of why the human preferred it, no representation of the evaluator’s goals or knowledge or values, no ability to predict what the evaluator would prefer in a genuinely novel situation. It is social learning with the social understanding stripped out. It works, within limits, because the statistical patterns in human preferences are regular enough to approximate. But it breaks down at the edges, precisely where the implicit values become complex, contextual, or contested.
The brain’s solution to the alignment problem, if we can call it that, is not reward optimization. It is social cognition: the ability to build models of other minds and use those models to infer goals that have never been stated explicitly. Evolution invested heavily in this capacity. The human prefrontal cortex, the brain region most associated with understanding other people’s mental states, expanded dramatically over the last two million years. The investment suggests the problem is hard, and that the solution is not a minor refinement of reward learning but a distinct cognitive architecture layered on top of it.
The Hybrid Frontier
The field has not resolved the paradox. It is navigating it.
The most direct attempt to reduce dependence on human feedback is a technique called RLAIF: reinforcement learning from AI feedback. Instead of asking a human which response is better, you ask another AI. The AI evaluator judges responses according to a set of written principles, a “constitution,” and generates preference data that can train a reward model the same way human data would. Anthropic published this approach in 2022 under the name Constitutional AI. The system writes a response, critiques its own response against the principles, revises, and repeats. In the reinforcement learning phase, an AI evaluator replaces the human annotator.
Constitutional AI reduces the need for human feedback. It does not eliminate it. The constitution itself is written by humans. The principles are human values, translated into language, and the AI evaluator is judging against those principles using a model that was itself trained, in part, on human preferences. The technique shifts the locus of human input from individual preference labels to higher-level principles. The humans are still in the loop. They are just operating at a higher level of abstraction.
A more ambitious approach is debate: two AI systems argue for opposing positions, and a human judge evaluates the arguments. The theory, developed by researchers at OpenAI in 2018, is that it is easier for a human to judge an argument than to generate one. If the AI systems are incentivized to expose flaws in each other’s reasoning, the human can make a reliable judgment even when the topic exceeds the human’s expertise. Debate is self-play applied to persuasion. The two systems play against each other, and the “winner” is determined by the human judge’s verdict. It imports the scalability of self-play into the evaluation of open-ended tasks.
Debate is an active research area. It has not been deployed at scale. The theoretical promise is clear: if it works, it would allow human oversight of AI systems that are more capable than the humans overseeing them. The practical challenges are substantial. The AI debaters might collude, producing arguments that sound good to the judge but are misleading. The judge might be systematically fooled by persuasive but incorrect reasoning. The format might favor rhetorical skill over truthfulness. These are the same problems that plague human institutions built on adversarial argument, from courtrooms to academic peer review. Importing self-play into evaluation does not eliminate the need for human judgment. It restructures it.
Meanwhile, a different strand of research is pushing self-play back into language model training, not for alignment but for reasoning. Multiple research groups in 2025 demonstrated that language models can improve their problem-solving abilities by playing structured games against themselves, with one copy generating problems and another solving them. The approach works for tasks with verifiable answers: mathematics, coding, logic puzzles. These are closed-world problems embedded in an open-world system. The math has a right answer. The code either runs or it does not. Self-play improves the reasoning. But the moment the task requires judgment rather than verification, the technique hits the same wall. You can self-play your way to better arithmetic. You cannot self-play your way to better values.
The emerging picture is that the most capable AI systems will use both methods, but for different things. Self-play and its variants will handle tasks where success can be verified: formal reasoning, game-playing, coding, mathematical proof. Human feedback, in some form, will remain necessary for tasks where success is defined by human values: helpfulness, honesty, safety, appropriateness, fairness. The boundary between these two categories is the central question of AI alignment. Where does verification end and judgment begin? For a mathematical proof, the answer is clear. For a medical recommendation, it is not. For a policy proposal, even less so. For a casual conversation, the entire notion of “correct” is dissolved in context.
The deepest lesson of the self-play paradox is about the nature of goals. In a closed game, the goal is part of the rules. Win. Maximize the score. The agent can discover how to achieve the goal without anyone explaining what the goal means. In the open world, the goal is not part of the rules. “Be helpful” is not a mathematical function. It is a human judgment, shaped by culture, context, and values that shift over time. No amount of self-play can discover a goal that lives outside the system playing the game.
This is why the most celebrated achievement of self-play, AlphaGo’s move 37, and the most important application of human feedback, RLHF in language models, are not in competition. They are solving different problems. One discovers optimal strategies within a defined goal. The other discovers what the goal should be. The first can be done by a system playing against itself. The second requires access to the species that has the goals.
The brain, as usual, does both. It learns from its own experience, through the reward prediction system traced in Article 2. And it learns from other minds, through a social cognitive system that evolution built for exactly the purpose that RLHF approximates: acquiring goals and values that cannot be discovered alone. The two systems are not redundant. Each does something the other cannot. The RL agent that plays against itself gets stronger. The RL agent that listens to humans gets aligned. The challenge for the field is building systems that do both.
The next article turns to a question those systems will eventually need to answer. Self-play learns strategies. Human feedback learns values. But neither one learns how the world works. For that, the agent needs something else: an internal model of the environment, a representation of what will happen next. The question of how to build that model, and what happens when you get it right, leads to the current frontier of the field.
Next in The RL Spiral: The World Inside.


