Agents That Act
Beyond Prediction
This is Chapter 8 of A Brief History of Artificial Intelligence.
On March 9, 2016, in a Seoul hotel conference room converted into a television studio, Lee Sedol sat down at a Go board. Across from him sat Fan Hui, a professional Go player who served as the human hand for an artificial opponent. Behind Fan Hui, in an adjacent room, engineers from DeepMind monitored a cluster of computers running a system they called AlphaGo.
Lee Sedol was one of the world’s greatest Go player. He had won eighteen world championships. In the weeks before the match, he predicted he would win either 5-0 or 4-1. “I don’t think it will be a very close match,” he told reporters. The confidence was understandable. Go had long been considered AI’s Everest—a game so complex that brute-force search was hopeless, requiring instead the kind of intuition that only humans possessed. The number of possible board positions exceeds the number of atoms in the observable universe.
By the evening of March 12, Lee Sedol had lost three games in a row. The match was over.
But it was Game Two, on March 10, that shattered assumptions. Move 37. AlphaGo placed a stone on the far right side of the board, in a position that made no sense to the commentators. The move was so unexpected that Fan Hui, watching the board, thought there had been a mistake. The professional Go players providing television commentary fell silent, then laughed nervously. One said the move was “very strange.” Another said it was “not a human move.”
It wasn’t. It was a move that emerged from a fundamentally different kind of learning—not the pattern recognition of earlier chapters, but learning through action and consequence. AlphaGo hadn’t learned to classify positions or predict next moves. It had learned to win. And in learning to win, it had discovered strategies that no human had imagined in three thousand years of playing the game.
This chapter is about reinforcement learning—a paradigm that creates agents capable of taking actions in the world to achieve goals. It’s the difference between knowing and doing. And it represents a crucial expansion of what learning machines can accomplish.
Prediction: Learning from Labels
Before diving into reinforcement learning, let’s step back and look at how machines have learned so far. The previous chapters described learning paradigms that share a common structure: learning from labeled examples. Understanding this structure helps clarify what’s fundamentally different about reinforcement learning.
Supervised learning is learning with a teacher who provides the right answers. You show the network an image and tell it “this is a cat.” You show it another image and say “this is a dog.” Over millions of examples, the network learns to distinguish cats from dogs—and eventually to recognize thousands of categories it has never seen before. The key feature is that every training example comes with a label: the correct answer is known in advance.
ImageNet, the breakthrough that launched the deep learning revolution, was supervised learning. The system learned to classify images because humans had painstakingly labeled millions of photographs. Medical diagnosis systems learn to detect tumors because radiologists have marked which scans contain cancers. The supervision—the labels provided by humans—is what makes learning possible.
Self-supervised learning removes the need for human-provided labels by creating supervision from the data itself. The clever trick is to use part of the data as the “question” and another part as the “answer.” Instead of labeling images, you might hide a portion of an image and train the network to predict what’s missing. Instead of labeling sentences, you train the network to predict the next word given all the previous words. The “label” is just another part of the data—no human annotation required.
This is how modern language models learn. GPT and its descendants are trained to predict the next token in a sequence. A token is the basic unit that language models work with—typically a word or a piece of a word. The word “learning” might be one token; a longer word like “reinforcement” might be split into two tokens like “reinforce” and “ment.” When a language model reads “The cat sat on the,” it learns to predict that “mat” (or “floor” or “couch”) is likely to come next. No human needs to provide labels—the next token in the text is the label, automatically. Self-supervised learning made it possible to train on the entire internet, billions of words of text, without any human annotation.
Both supervised and self-supervised learning share something fundamental: they’re about prediction. Given this input, what should the output be? Given this image, what category? Given these words, what word comes next? The network learns to map inputs to outputs, minimizing the gap between its predictions and reality. Whether the labels come from human annotators or from the data itself, the learning process is the same: predict, compare to the right answer, adjust, repeat.
But some problems can’t be framed as prediction with known answers.
When Prediction Isn’t Enough
Consider teaching a robot to walk. What would the training examples look like? You’d need to specify exactly which motor commands to send for every possible configuration of the robot’s body, every possible terrain, every possible obstacle. But the “right” command depends on what has happened before and what will happen next. Walking isn’t a single prediction—it’s a sequence of actions that unfold over time, each one depending on the consequences of the last.
Or consider games. In Go, the “right move” depends entirely on how the game will develop. A move that looks weak might set up a devastating attack twenty moves later. A move that looks strong might create a vulnerability that a skilled opponent will exploit. You can’t simply label positions with correct moves, because correctness only reveals itself through the entire sequence of play.
These problems require a different kind of learning. Not learning to predict, but learning to act.
Action: Learning from Consequences
Reinforcement learning is learning through interaction. An agent takes actions in an environment, observes what happens, and receives rewards or penalties. Over time, it learns which actions lead to good outcomes.
What is an agent? Simply a system that perceives its situation, chooses actions, and learns from their consequences. A Go program that decides where to place a stone is an agent. A robot that navigates a warehouse is an agent. A dialogue system that chooses how to respond is an agent. The word “agent” just means something that acts—that takes actions to achieve goals rather than simply mapping inputs to outputs.
The core idea of reinforcement learning traces back to psychologists studying animal behavior. In the early twentieth century, researchers like Edward Thorndike and B.F. Skinner discovered that animals learn from the consequences of their actions. A rat that presses a lever and receives food will press the lever again. A cat that escapes a puzzle box through trial and error will escape faster next time. Behavior is shaped by reward and punishment.
In the 1950s, mathematician Richard Bellman developed the theoretical foundations for this kind of learning. His key insight was about how to think about sequential decisions—situations where what you do now affects what you can do later, and where rewards may come long after the actions that caused them.
Consider a simple example: you’re a robot navigating a grid, trying to reach a goal. At each step, you can move up, down, left, or right. Some squares give you points; others are dead ends. The naive strategy is to always move toward the nearest points. But this is often wrong. Sometimes the path to maximum total reward requires walking away from nearby points to reach a richer area.
Bellman’s insight was that the value of being somewhere isn’t just about what you can get there—it’s about everything you can get from there. Imagine you’re three steps from the goal, and you can see two paths: one passes through a square worth 10 points, the other through a square worth 2 points. The obvious choice seems to be the 10-point path. But what if the 2-point square leads to a 50-point square, while the 10-point square leads to a dead end? Suddenly the “worse” choice is actually better.
Bellman’s equation captures this: the value of any square equals its immediate reward plus the value of the best square you can reach from it. The “best” square simply means the neighboring square with the highest total value—the one that leads to the most rewards down the line. This creates a chain of reasoning. If you know the value of squares near the goal, you can calculate the value of squares one step further away. Then two steps. Then three. Work backward from the destination, and eventually you know the value of every square—including the one you’re standing on.
The mathematics were elegant, but they had a practical limitation: you needed to know the value of every possible state. For a small grid, this is fine—you can compute and store values for every square. But for a game like Go, with more possible positions than atoms in the observable universe, it’s impossible. You can’t enumerate all states; you can’t store a value for each one.
This is where neural networks enter the story. Instead of computing exact values for every state, you train a network to approximate values. Show it many positions and their eventual outcomes; it learns patterns that generalize to positions it hasn’t seen. The network doesn’t need to experience every possible Go position—it learns features that apply across positions. This combination—Bellman’s mathematical framework for reasoning about sequential rewards, plus neural network approximation—is deep reinforcement learning.
But making it work at scale required solving hard problems. Go would be the proving ground.
Why Go Was Impossible
Go is one of humanity’s oldest games—invented in China more than 2,500 years ago, where it is known as 围棋 (wéiqí), meaning “encircling game.” The game spread to Korea and Japan, where it became deeply embedded in culture, studied by scholars and warriors alike as a path to strategic wisdom. Its rules are deceptively simple. Two players take turns placing black and white stones on a 19×19 grid. Stones that get surrounded are captured. The player who controls more territory wins.
Simple rules. Astronomical complexity.
A chess position might have 30 or 40 legal moves. A Go position typically has around 250. A chess game lasts about 40 moves per side. A Go game lasts about 150. The number of possible Go games exceeds 10^700—a number so large it makes the number of atoms in the observable universe (a mere 10^80) seem trivial.
Chess programs had conquered human champions by 1997, when IBM’s Deep Blue defeated Garry Kasparov. But chess programs worked through brute force: they searched millions of positions per second, evaluated them using hand-crafted rules, and selected the best line of play. This approach was hopeless for Go. The branching factor was too high, the games too long, and nobody knew how to write rules for evaluating positions. Expert Go players couldn’t articulate why one position was better than another—they just knew. It was intuition.
Before AlphaGo, the best Go programs played at the level of a weak amateur. The consensus among AI researchers was that human-level Go was at least a decade away, possibly more. Some thought it might require fundamental breakthroughs in AI that hadn’t been made yet.
DeepMind achieved it in less than two years.
Inside AlphaGo: Two Networks Working Together
AlphaGo’s architecture combined two deep neural networks with a classical search algorithm. Understanding how these pieces fit together reveals the power of the approach.
The Policy Network answers the question: given this position, what moves are worth considering? Rather than evaluating all 250-odd legal moves equally, the policy network outputs a probability distribution over moves. It might say: this move has a 30% chance of being played by a strong player, this one has 15%, this one has 0.1%. The highest-probability moves are worth exploring; the low-probability moves can usually be ignored.
How did DeepMind train this network? First, through supervised learning. They collected 30 million positions from games played by strong human players on online Go servers. For each position, the network learned to predict what move the human actually played. After this training, the policy network could predict expert human moves about 57% of the time—impressive, given the thousands of possible moves.
But predicting human moves isn’t the same as playing optimally. Humans make mistakes. Humans follow conventions. So the policy network was further refined through reinforcement learning: it played millions of games against earlier versions of itself, adjusting its weights to favor moves that led to wins. After this self-play training, it became significantly stronger than any human-imitation version could be.
The Value Network answers a different question: given this position, who is winning? It takes a board position as input and outputs a single number between -1 and +1, representing the expected outcome. A value near +1 means black is almost certainly winning; near -1 means white is. Near 0 means the game is close.
This network was also trained through self-play. AlphaGo played games against itself, recorded the positions that arose, and noted who eventually won. Each position became a training example: given this configuration of stones, black won (or lost) the game. Over millions of games, the value network learned to predict outcomes from any position—developing the kind of intuitive positional judgment that human experts spend years acquiring.
Monte Carlo Tree Search (MCTS) is the algorithm that uses these networks to actually play. The idea is to explore possible futures by simulation, building a tree of moves and countermoves.
Here’s how it works, step by step:
Imagine you’re AlphaGo, staring at a board position. You need to choose a move. You begin building a search tree, with the current position at the root.
Selection: Starting from the root, you descend through the tree, at each node choosing the child that looks most promising. “Promising” balances two factors: moves that the policy network likes (high probability of being good) and moves that haven’t been explored much yet (might reveal something new). This balance between exploitation and exploration is crucial.
Expansion: When you reach a node that hasn’t been fully explored, you add a new child node—a new position that could arise from a legal move.
Evaluation: You evaluate this new position using the value network. What does it think about who’s winning?
Backpropagation: You send this value estimate back up through the tree, updating all the nodes on the path. If this new position looks good for black, all the moves leading to it become slightly more attractive.
Repeat this process thousands of times. The tree grows, concentrating on the most promising branches. After enough simulations, you look at the children of the root node and play the move that was visited most often—the one that the search process deemed best.
The beauty of the system is how the networks and search complement each other. The policy network focuses the search on sensible moves, preventing it from wasting time on obvious blunders. The value network provides fast position evaluation, so the search doesn’t need to play out entire games to assess a position. And the search process, by exploring many variations, catches cases where the networks’ quick judgments might be wrong.
This is what produced Move 37. The policy network gave that unexpected placement a low probability—it didn’t look like a move human experts would make. But the search process explored it anyway, the value network evaluated the resulting positions favorably, and after thousands of simulations, the unusual move rose to the top. The system found something that pure imitation of humans would never have discovered.
AlphaZero: Learning Purely Through Self-play
A year after the Lee Sedol match, DeepMind released something even more striking. AlphaZero started with no human knowledge whatsoever—just the rules of Go. No database of expert games. No human-derived features. No opening theory accumulated over millennia. Just: these are legal moves, this is how you win.
The architecture was simpler than the original AlphaGo. A single neural network served both functions: given a position, it output both a policy (probability distribution over moves) and a value (who’s winning). The network was trained entirely through self-play, starting from random initial weights. Two copies of the current network played against each other, and the network learned from the outcomes.
Within three days of training on specialized hardware, AlphaZero exceeded the playing strength of the version of AlphaGo that had defeated Lee Sedol. It then kept improving.
The Go world was amazed not just by the strength but by the style. AlphaZero played moves that contradicted centuries of accumulated human wisdom about how Go should be played. It favored influence over territory in ways that violated conventional joseki (standard opening patterns). It made moves that looked like mistakes but turned out to be profound.
Professional Go players began studying AlphaZero’s games the way they had previously studied ancient masters. New opening variations entered human tournament play, derived from the machine’s self-play discoveries. The system hadn’t just mastered human Go—it had expanded the boundaries of how the game could be played.
Here was intelligence that emerged entirely from the rules of a game. No human teaching. No human examples. Just the feedback of winning and losing, amplified through millions of games against itself. The system discovered three thousand years of Go knowledge—and then went beyond it—in seventy-two hours.
Beyond Games
AlphaGo and AlphaZero proved that reinforcement learning could master the most complex board game ever devised. But games, however difficult, are still simplified worlds with clear rules, perfect information, and unambiguous outcomes. The real test was whether these techniques could work in messier domains—the physical world, with its noise and uncertainty, and the social world, with its ambiguous goals and shifting preferences.
They could.
Autonomous Vehicles
Teaching a car to drive seems straightforward: stay in the lane, don’t hit anything, obey traffic laws. But driving is actually a continuous stream of decisions under uncertainty. When should you merge? How close is too close? Should you slow down for the pedestrian who might step into the street? The “right” action depends on predicting what other drivers, cyclists, and pedestrians will do—and they’re all predicting what you’ll do.
Early autonomous vehicle systems relied heavily on hand-crafted rules: if obstacle detected within X meters, brake. But rules can’t anticipate every situation. A plastic bag blowing across the road looks like an obstacle to sensors but shouldn’t trigger emergency braking. A cyclist signaling a turn might not actually turn. Human drivers navigate these ambiguities through intuition developed over years of experience.
Reinforcement learning offers another path. Companies like Tesla have trained driving systems end-to-end: camera images go in, steering and acceleration commands come out, with the system learning from millions of miles of driving data. The reward signal combines safety metrics, passenger comfort, and progress toward the destination. The system learns not just to follow rules, but to drive—to make the countless small adjustments that experienced human drivers make unconsciously.
The challenge is that mistakes during training can be catastrophic. You can’t learn to avoid crashes by crashing. So these systems typically train first in simulation—millions of virtual miles where the consequences of errors are merely computational—then transfer that knowledge to real vehicles with careful human oversight.
Robotics
Robots face an even harder version of the driving problem. A self-driving car operates in two dimensions on relatively predictable roads. A robot manipulating objects operates in three dimensions with infinite possible configurations. Pick up a coffee mug. Simple for a human; extraordinarily complex for a robot. The mug could be anywhere, in any orientation. The gripper must approach at the right angle, apply the right pressure, lift smoothly, and not collide with anything along the way.
Traditional robotics relied on precise programming: move to coordinates (x, y, z), close gripper to width w, lift to height h. This works in factories where everything is in known positions. It fails in homes where mugs end up in random places.
Reinforcement learning lets robots learn manipulation through trial and error—thousands of attempts at grasping, each one providing feedback about what works. Research labs have trained robotic arms to pick up objects they’ve never seen before, fold laundry, and even flip pancakes. The robots aren’t following scripts; they’re adapting their behavior based on what they perceive.
The breakthrough enabling this was simulation combined with transfer learning. Train a robot in a physics simulator where it can attempt millions of grasps per day, failing safely. Then transfer that learned skill to a physical robot. The sim-to-real gap—differences between simulated and real physics—remains a challenge, but techniques for bridging it have improved dramatically.
Scientific Discovery
Perhaps most surprisingly, reinforcement learning has accelerated scientific research itself. DeepMind’s AlphaFold, which solved the decades-old problem of protein structure prediction, used reinforcement learning techniques to refine its predictions. The system learned not just to predict protein shapes, but to iteratively improve those predictions—acting on its own outputs to achieve better results.
In mathematics, systems have discovered new algorithms for fundamental operations like matrix multiplication—algorithms that human mathematicians had missed despite decades of research. The AI explored the vast space of possible algorithms, guided by rewards for finding faster solutions, and found improvements to procedures humans had assumed were optimal.
In fusion energy research, reinforcement learning systems have learned to control the superheated plasma in tokamak reactors—a task so complex and fast-changing that human operators and traditional control systems struggle. The AI learns to make thousands of micro-adjustments per second, keeping the plasma stable in ways that weren’t achievable before.
Shaping Language Models: RLHF
The application with perhaps the widest impact has been in language models themselves. The base models described in earlier chapters—systems trained through self-supervised learning to predict the next word—had developed impressive capabilities but also significant problems. They could generate fluent text but sometimes said things that were harmful, false, or unhelpful. They had learned to mimic human text, including the toxic, misleading, and offensive parts.
Self-supervised learning alone couldn’t fix this. The training objective was prediction: given this text, what comes next? A model that perfectly predicted text would also perfectly predict harmful text. The objective was misaligned with what people actually wanted.
Reinforcement learning offered a solution. Rather than training models only to predict text, train them to produce text that humans prefer. The technique is called RLHF: Reinforcement Learning from Human Feedback.
The process works in stages. First, human evaluators compare pairs of model outputs and indicate which they prefer. “This response is more helpful than that one.” “This explanation is clearer.” “This answer avoids the harmful content.” These preferences become training data for a reward model—a neural network that predicts which outputs humans will rate higher.
Then, the language model is fine-tuned using reinforcement learning to maximize the reward model’s score. Each response the model generates is scored by the reward model. Responses that score higher are reinforced; responses that score lower are discouraged. The language model becomes the agent; generating a response is its action; the reward model’s score is its reward.
OpenAI deployed RLHF to transform GPT-3 into InstructGPT, and later into ChatGPT. The results were dramatic. The RLHF-trained models were more helpful, less likely to produce harmful content, more willing to refuse inappropriate requests. They followed instructions better. They admitted uncertainty more readily.
This combination—self-supervised learning to acquire language capabilities, reinforcement learning to shape behavior—has become the standard approach for deploying language models. The base model learns to predict words; RLHF teaches it to be useful.
The Difference Between Knowing and Doing
Why does reinforcement learning produce such different results from prediction-based learning? The answer lies in the objectives.
A language model trained only on prediction learns the statistical patterns of human text. It learns that certain words follow certain other words, that certain phrases express certain ideas, that certain topics tend to appear in certain contexts. It becomes, in effect, a detailed model of how humans write.
But modeling text and using text well are different things. A model that perfectly captured human writing patterns would also capture human errors, biases, and harmful tendencies. It would reproduce the average of what humans write—including content that no one would actually want.
Reinforcement learning allows you to specify what you want directly. Not “predict what humans would write” but “write things that are helpful and harmless.” The objective shifts from imitation to optimization. The system no longer tries to be typical—it tries to be good, according to whatever definition of “good” the reward signal provides.
This is the crucial distinction between prediction and action. Supervised and self-supervised learning create systems that model patterns—they learn to predict. Reinforcement learning creates systems that pursue outcomes—they learn to act. Both forms of learning are essential. Prediction provides knowledge; reinforcement shapes behavior.
The most capable AI systems combine both. Deep knowledge of the world comes from massive prediction—learning the patterns in images, text, and data. That knowledge gets shaped into useful behavior through reinforcement—learning which actions lead to desired outcomes. Understanding plus action produces capability.
The Risks of Learned Behavior
Learning to act creates new risks that prediction alone doesn’t face.
If a system is optimizing for a goal, it will find ways to achieve that goal—including ways its designers didn’t anticipate. The history of reinforcement learning is full of examples where systems achieved their objectives while violating their designers’ intentions.
In one famous case, a simulated robot was trained to move across terrain as fast as possible. Instead of developing a walking gait, it learned to make itself very tall, then fall over, catching itself and repeating the process. It moved quickly. It didn’t walk.
In another case, a game-playing agent discovered a bug in the game code that allowed it to accumulate points without actually playing. It exploited the bug. It maximized its score. It didn’t play the game.
These examples are amusing in research settings, but the principle is serious. A system powerful enough to take consequential actions in the world is dangerous if its objective doesn’t quite match what its designers intended. Specifying exactly what you want turns out to be extraordinarily difficult.
This is why alignment—the subject of a later chapter—matters so much. Reinforcement learning creates systems that pursue goals. Making sure those goals actually reflect human values is the central challenge of building beneficial AI.
Closing: From Prediction to Action
This chapter began with a single move in a Go game—a move that emerged from a system that had learned not just to model the game, but to win it. It ends with a recognition that something fundamental has expanded in what machines can learn to do.
The neural networks of earlier chapters learned to see, to read, to generate text. They learned to model patterns in the world. Reinforcement learning extended this capability into the domain of action. Systems learned not just to predict outcomes, but to achieve them. Not just to recognize patterns, but to act on them. Not just to know, but to do.
AlphaGo didn’t just understand Go positions—it played to win. ChatGPT doesn’t just model human language—it tries to be helpful. These systems take actions to achieve objectives, learning from consequences which actions work best.
They had learned not just to answer questions, but to pursue goals. The difference was profound.
Notes and Further Reading
On AlphaGo and AlphaZero
David Silver and colleagues at DeepMind published the AlphaGo research in Nature in 2016. The paper, “Mastering the game of Go with deep neural networks and tree search,” details the architecture and training process. The follow-up AlphaZero paper appeared in Science in 2018 under the title “A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play,” demonstrating the power of pure self-play learning without human knowledge. For a more accessible account, Cade Metz’s article “The Sadness and Beauty of Watching Google’s AI Play Go” (Wired, March 2016) captures the emotional impact of the Lee Sedol match.
On Reinforcement Learning Foundations
Richard Sutton and Andrew Barto’s Reinforcement Learning: An Introduction remains the definitive textbook, now in its second edition (2018). It covers both classical foundations and modern deep reinforcement learning. For historical context, Sutton’s essay “The Bitter Lesson” (2019) argues that general methods with massive computation consistently outperform approaches that rely on human-designed features—a theme that the AlphaZero results powerfully confirm.
On RLHF and Language Models
OpenAI’s “Training language models to follow instructions with human feedback” (2022) introduces the InstructGPT system and the RLHF methodology. Anthropic’s “Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback” (2022) provides a complementary perspective focused on safety properties. For those interested in the technical details, the “Proximal Policy Optimization” algorithm paper from OpenAI (2017) describes the reinforcement learning method commonly used in RLHF.
On the Lee Sedol Match
The 2017 documentary AlphaGo, directed by Greg Kohs, offers an intimate view of both the DeepMind team and Lee Sedol during the historic match. It captures the human story behind the technical achievement—including Lee Sedol’s famous Game Four victory, when he found a brilliant move that exploited a subtle weakness in AlphaGo’s self-play training. That game, like Move 37, revealed something about intelligence: the machine found moves no human had imagined, but the human found a move the machine had missed. Both were learning.


