The RL Spiral: The Equation That Explains Your Brain
Every advanced AI runs on an equation. Your brain has been running it for 500 million years. A neuroscientist proved it by accident. That accident changed two fields at once.
This is the second article in The RL Spiral, an eight-part series on reinforcement learning. The first article, The Reward Trap, explored why reward specification is so hard. This one explains where the reward signal came from.
Every mainstream reinforcement learning algorithm shares a single mathematical core. The algorithm that taught a computer to play Atari uses it. The algorithm behind AlphaGo uses it. The reward model training behind RLHF, the technique that shaped ChatGPT and every other major conversational AI, uses it. Strip away the architectural complexity of any modern deep RL system and you find the same quantity underneath: temporal difference prediction error.
A computer scientist named Rich Sutton wrote it down in 1988, working at a telephone company’s research lab, publishing in a journal that had existed for barely two years. Almost nobody noticed.
Nine years later, a neurophysiologist in Switzerland named Wolfram Schultz was recording the electrical activity of individual dopamine neurons in the brains of macaque monkeys. He had been studying movement, not reward. But the firing pattern he was seeing matched Sutton’s equations with a precision that could not be coincidental. The brain, it turned out, had been running the same algorithm. Not something like it. The same mathematical structure, arrived at through completely different means, in a completely different medium.
That convergence changed two fields simultaneously. It gave neuroscience a formal language for what dopamine was actually doing. It gave reinforcement learning biological validation from a source that knew nothing about computer science. And it opened a research program that is still running: computational psychiatry, where RL models now explain addiction, depression, and psychosis as specific failure modes of the brain’s prediction error machinery.
This is the center of the series. Understanding it requires going back to 1988, to a paper almost nobody read.
The Paper Nobody Read
Rich Sutton was thirty-one years old when he submitted the paper that would eventually be recognized as one of the foundational documents of reinforcement learning. It was 1988. He was working at GTE Laboratories, a telephone company’s research division in Waltham, Massachusetts. Not at a university. Not at a place the field was watching. The journal he chose was Machine Learning, which had been publishing for barely two years. The title was dry even by academic standards: “Learning to Predict by the Methods of Temporal Differences.”
It landed quietly. Not because it was bad, but because most of the field was not yet ready to recognize what it contained.
Sutton had been thinking about learning and prediction since his PhD at the University of Massachusetts in the early 1980s. He had worked with Andrew Barto on some of the earliest formal models of reinforcement learning, and their collaboration had shaped his conviction that the key to machine learning was not feeding a system labeled examples and telling it the right answers, but letting it learn from its own experience in real time. The 1988 paper was the formal culmination of ideas he had been developing for nearly a decade.
The problem Sutton was working on was deceptively simple: how do you teach a system to make accurate predictions about the future, using only information available in the present? The standard answer at the time was to wait for the outcome. An episode plays out, a result arrives, and you use the gap between what you expected and what actually happened to update your model. Researchers called this Monte Carlo learning, after the statistical tradition of estimating a quantity by sampling complete trials and averaging the results. It works. It is also slow. Every prediction update requires waiting for the whole sequence to finish before you can learn anything.
Sutton saw a different path. You do not have to wait for the final outcome to learn. You can update at every step, using not the final outcome but the difference between your current prediction and your previous prediction. Your estimate at step two corrects your estimate at step one. Your estimate at step three corrects step two. The signal that drives learning is not the distance between your prediction and the final truth. It is the distance between adjacent predictions.
Think of a chess player assessing a position mid-game. A Monte Carlo approach would mean reserving judgment until the game ends, then updating based on whether you won or lost. Sutton’s approach lets the player update after every move, using the new position’s assessment to correct the old one. Next time a similar position appears, the assessment starts closer to the truth. You learn continuously, not episodically. Each new prediction is both an estimate of the current position and a correction to the last.
Sutton’s approach is not just faster. It also solves a deeper problem. In a long sequence of decisions, which specific decision deserves credit for the final outcome? Monte Carlo updates every step from the same final result. A chess game lasting two hundred moves ends in a win. Every move in the game receives the same learning signal: the game was won. The brilliant sacrifice on move forty and the irrelevant shuffle on move one hundred and fifty are credited equally. Sutton’s method assigns credit incrementally. After the sacrifice, the position improves sharply, and the estimate jumps. That jump is the learning signal: this move mattered. After the shuffle, the estimate barely changes. No jump, no credit. Value propagates backward through the chain, and each decision carries weight proportional to how much it actually shifted the trajectory.
Say the system estimates a mid-game position at 0.60: a 60 percent chance of winning from here. A sacrifice is made. The new position gets estimated at 0.78. The gap between 0.60 and 0.78 is the correction signal. The previous estimate was too pessimistic. The system revises upward. Next time it sees a similar position, it will estimate closer to 0.78. Play enough games, and the estimates converge toward something accurate. The signal driving all this learning is never the final outcome of the game. It is always the difference between adjacent predictions, one step apart in time.
He called this temporal difference learning, because the signal came from the difference between predictions at different points in time.
Sutton had not invented this from nothing. Its roots ran back to Richard Bellman, a mathematician working at the RAND Corporation in Santa Monica in the early 1950s. Bellman had been thinking about optimal decision-making: how should a system choose actions to maximize long-term value? His key insight was that you never need to see the whole future at once. The value of where you are now equals the reward you get right now, plus the value of where you end up next. That next position is itself just its immediate reward, plus the value of the position after that. Follow this chain far enough, and you have a complete account of long-term value.
In most real-world problems involving delayed rewards, including most games, the immediate reward at each intermediate step is zero. Only the final outcome carries a non-zero signal. The chain is still valid. It just means that the only non-zero signal sits at the end, and the chain carries it backward, step by step, until every position along the way carries a number: from here, this is how likely you are to win.
The problem was that Bellman’s formula required a complete map of the environment: every possible state, every possible transition, every possible outcome. For a game as simple as tic-tac-toe, that is feasible. For chess, with more possible positions than any computer could enumerate in the lifetime of the universe, it is not. For the real world, it is not even close. Bellman named this difficulty himself. He called it the curse of dimensionality. The name stuck.
Sutton’s temporal difference methods were a way of approximating Bellman’s solution from experience, without needing a map. Instead of solving Bellman’s formula exactly, a temporal difference learner, or TD learner, estimates it incrementally, updating each prediction as new experience arrives. The estimates start rough. They improve with every correction. Given enough experience, they converge toward the right answer. No map required.
The quantity that drives all of this is the TD prediction error. Take the reward you got immediately from this step, add what you now expect from here, and subtract what you had predicted this position was worth. That difference is the error signal. When your prediction was too low, the error is positive. When your prediction was too high, the error is negative. When you were exactly right, the error is zero and nothing changes.
None of this was visible in 1988. What was visible was a technically careful paper in a mid-tier journal, making a contribution that most readers catalogued as incremental. It would take years for the field to appreciate what Sutton had found. By then, it would be confirmed from a completely unexpected direction.
The Monkey and the Juice
Wolfram Schultz had not set out to study reward. He was a neurophysiologist at the University of Fribourg in Switzerland, and in the late 1980s his focus was the basal ganglia, a set of structures deep in the brain known to be involved in movement. The clearest evidence for their importance was negative: when they were damaged, as in Parkinson’s disease, voluntary movement deteriorated dramatically. Schultz wanted to understand what they actually contributed to normal function.
To study the basal ganglia in awake, behaving animals, he implanted fine electrodes in the brains of macaque monkeys. The electrodes were sensitive enough to detect the electrical activity of single neurons. He could listen to one neuron at a time, recording its firing pattern as the animal moved and learned.
For the conditioning experiments, the setup was standard. A light or a tone would appear. A few seconds later, a small drop of apple juice would arrive in the monkey’s mouth. Simple, repeatable, controlled. The monkey would learn the association, and Schultz could watch what happened in the brain as the learning unfolded.
The early results were unsurprising. When the juice arrived, neurons in the brain’s dopamine-producing regions fired. The ventral tegmental area and the substantia nigra, two clusters of dopamine-producing cells deep in the midbrain, showed brief, sharp bursts of activity at the moment of reward. Dopamine was associated with reward. This was already known. The firing seemed to confirm it.
Then, as the monkey learned, the pattern changed.
After many repetitions of the light-then-juice sequence, the dopamine neurons stopped responding to the juice itself. Instead, they fired when the light appeared. The signal had moved backward in time: from the reward to the event that predicted the reward.
Schultz kept watching. When the light appeared and the juice arrived on schedule, the neurons fired at the light and stayed flat at the juice. But when the light appeared and no juice followed, something unexpected happened at the precise moment the juice would have arrived. The dopamine neurons did not simply stay quiet. Their activity dropped below baseline. The absence of an expected reward produced an active downward signal.
Schultz had found a three-part pattern. When something better than predicted happened: a burst. When something exactly as predicted happened: silence. When something worse than predicted happened, including outright absence: a dip. Both the timing and the direction of the signal carried information. And crucially, the signal tracked prediction, not reward itself. If a reward arrived but had not been predicted, it triggered a burst. If a reward arrived exactly on cue, it triggered nothing. The dopamine neurons were not reporting what happened. They were reporting the gap between what happened and what was expected.
This three-part pattern is not just a feature of the dopamine system. It is the signature of learning in progress. The burst marks what the system has not yet learned to predict. The silence marks what it has. The dip marks where a prediction needs to be revised. Over enough repetitions, the bursts shrink, the silences grow, and the system converges on an accurate model of what to expect and when. The dopamine neurons do not just resemble a TD learner. They learn the way a TD learner learns.
He published these findings across several papers through the early 1990s, reporting each observation precisely. One detail was particularly striking. If a second predictor was introduced earlier in the sequence, a bell that reliably preceded the light that reliably preceded the juice, the dopamine response migrated again. It moved from the light to the bell. The signal kept propagating backward through the chain to the earliest reliable predictor. This is exactly what a TD learner should do: once you can predict a prediction, that earlier predictor absorbs the value signal, and the later one falls silent. The movement of the signal through time was not just consistent with TD learning. It was one of TD learning’s most distinctive predictions.
That connection came from Cambridge, Massachusetts.
The Same Equation
Peter Dayan was a computational neuroscientist working at MIT’s Center for Biological and Computational Learning. Read Montague was a neuroscientist at Baylor College of Medicine, collaborating with Terrence Sejnowski at the Salk Institute. Both Dayan and Montague were fluent in the mathematics of reinforcement learning and in the neuroscience of reward. They had been working for several years on the question of whether the brain’s learning mechanisms could be understood in formal, computational terms.
When they encountered Schultz’s data, they recognized something that Schultz himself had not named.
The three-part pattern was not reminiscent of TD prediction error. It was TD prediction error. Burst when reward was better than predicted: a positive TD error. Silence when reward arrived as predicted: a zero TD error. Dip when reward failed to arrive: a negative TD error. The temporal structure matched. The directional structure matched. The relationship to prediction, rather than to reward itself, matched. The mathematical structure Sutton had worked out on paper in 1988 was present in the firing of dopamine neurons in the primate midbrain.
In 1996, Montague, Dayan, and Sejnowski published the formal connection in the Journal of Neuroscience. The paper showed that a model using TD prediction errors to update value estimates produced firing patterns that matched Schultz’s dopamine data in quantitative detail. The manuscript had been rejected seven times before it was accepted. Reviewers did not dispute the math. They disputed the premise: the idea that a theoretical framework from computer science could explain, in quantitative detail, what specific neurons were doing in a primate brain. In 1994 and 1995, this was not an easy sell.
A year later, in March 1997, Schultz joined Dayan and Montague as co-author on a paper in Science. The paper was blunt: the activity of midbrain dopamine neurons during learning is consistent with the temporal difference prediction error signal used in reinforcement learning algorithms. It became one of the most cited papers in neuroscience. Montague later described the core of the recognition with characteristic directness. They had, he said, tripped and guessed a basically correct setting.
The implications spread in two directions at once.
For neuroscience, the finding gave dopamine a precise job description for the first time. Dopamine had long been associated with pleasure, or motivation, or salience. These descriptions were not wrong, but they were not specific enough to generate predictions. TD prediction error was specific. It said exactly what dopamine should do in any given experimental condition, and researchers could test it.
For reinforcement learning, the validation was different in character. RL researchers had built their algorithms on mathematical and engineering grounds, without any reference to the brain. The convergence with dopamine neuroscience did not prove those algorithms were correct. But it was evidence that they were tracking something real about the nature of learning under uncertainty. Evolution, working over hundreds of millions of years with selection pressures that had nothing to do with computer science, had arrived at the same solution. That kind of independent confirmation, when it occurs, tends to matter.
What neither the 1996 nor the 1997 paper settled was why. Was the convergence because TD learning is the uniquely optimal solution to the prediction problem under biological constraints, so that any sufficiently powerful learning system would converge on it? Or was there something about the specific architecture of neural circuits that happened to be well-suited to TD-like computation for reasons having nothing to do with optimality? The question is open. The fact of the convergence does not explain the convergence. It simply establishes that the same mathematical structure appeared twice, in two completely different media, arrived at through completely different routes.
What the papers did settle, practically, was the research agenda for the following two decades. Neuroscientists began testing the framework in humans, not just monkeys. They measured what happened to the prediction error signal when dopamine was artificially raised or lowered. They looked for TD-like signals in other brain structures. The framework kept generating predictions, and the predictions kept holding up. That is the usual sign that something real has been found.
The two threads of this series, the RL thread and the neuroscience thread, became the same thread in March 1997. Everything that follows in this series lives in the aftermath of that moment.
When the Algorithm Breaks
The most productive consequence of the Schultz-Dayan-Montague synthesis was not the convergence itself. It was what happened when researchers started examining the places where the dopamine system deviated from a perfect TD learner.
The dopamine system is not a perfect TD learner. It is close enough that the framework applies with real precision, but it has consistent quirks. Each one is a solution evolution found for running a prediction error system in a world where mistakes can kill you and resources are scarce.
It is chemically tunable: the brain can raise or lower dopamine levels to recalibrate the system’s sensitivity as conditions change. It processes losses more strongly than equivalent gains, because in a world where one bad decision can be fatal, overweighting negative outcomes keeps you alive. And it fires not just on confirmed predictions but on novel or uncertain stimuli, flagging anything that might matter before the system knows whether it does. In a complex, unpredictable environment, erring on the side of attention is worth the cost.
Each of these properties, under normal conditions, is a feature. Under extreme conditions, each one becomes a specific pathology. The chemical tunability that allows recalibration can be hijacked by drugs. The asymmetry that protects against fatal mistakes can tip into depression. The novelty-flagging that keeps you alert can become psychosis. Three adaptive solutions. Three failure modes. Each one revealing, in its breakdown, what the system was built to do.
The clearest demonstration is addiction.
Cocaine and amphetamine work, in part, by flooding the dopamine system with activity that has nothing to do with actual outcomes in the world. Normally, after a dopamine neuron fires, the signal is brief: the dopamine is quickly cleared away and recycled, and the concentration drops back to baseline before the next firing. Cocaine blocks that cleanup. Each new burst of dopamine lands on top of what has not yet been cleared. The concentration builds far higher than any natural event would produce. The downstream circuits that read this signal have no way to distinguish between dopamine elevated by a genuine surprise and dopamine elevated because a chemical prevented its removal. They read the concentration, and the concentration says: something far better than expected just happened. The prediction error signal is not reporting the world. It is being inflated at the source. The appropriate response, from the perspective of the circuits that implement TD learning, is to update strongly: build a model in which whatever preceded this state is extremely valuable, and orient behavior toward obtaining it again.
The algorithm runs correctly. It does exactly what a TD learner should do in response to a large positive prediction error. The problem is that the signal is disconnected from any actual outcome in the world. The brain builds a model in which the drug is the most rewarding thing it has ever encountered, because that is precisely what the TD signal is reporting. Every other source of reward, food, friendship, work, accomplishment, looks pale by comparison. The gap between the brain’s learned model and what the person’s actual life delivers widens with every dose. This is not a failure of willpower. It is specification gaming enacted in biological hardware: the proxy signal has been decoupled from the underlying goal by a chemical that manipulates the signal directly.
Depression tells a different story. In addiction, the prediction error signal is too strong. In depression, it is lopsided. When something bad happens, the negative prediction error fires normally, or even fires harder than it should. The brain updates: the world is worse than I thought. But when something good happens, the positive prediction error is muted. The brain barely updates. A promotion, a compliment, a beautiful day: the signal that should say “things are better than expected” arrives faint, or not at all. Over time, the TD learner constructs a persistently pessimistic model. Not because the world is uniformly bad, but because the machinery for revising predictions upward is impaired while the machinery for revising them downward works fine. The algorithm keeps running. It is running with a broken dial.
Schizophrenia adds a third pattern. In addiction, the signal is inflated. In depression, it is lopsided. In certain psychotic states, the signal fires on things that carry no predictive value at all. A stranger’s glance, a pattern in the wallpaper, a stray thought: the prediction error system flags them as intensely meaningful. The brain treats them the way it would treat a genuine surprise, assigning significance, building associations, orienting attention. The subjective experience of ordinary perceptions feeling charged with hidden importance may be what it feels like from the inside when the prediction error signal misattributes relevance at scale. This is not a metaphor. It is a testable hypothesis, and some of its predictions about patient behavior in reward-learning tasks have held up in experiments.
These three cases turned the Schultz-Dayan-Montague framework from a finding into a research program. Researchers in computational psychiatry now design experiments derived directly from RL theory, measuring how individual patients update their predictions in response to outcomes. In some cases, the patterns correlate with symptom severity. In others, they predict which treatments will work. The framework is earning its keep.
The broader point is what it means for RL. The brain’s TD learner is imperfect, but it operates under constraints no silicon system has to face: limited neural resources, the need to sleep, social context that shifts the reward landscape, environments that change without warning, and the fact that bad decisions can be fatal. The quirks are the solutions evolution found for those constraints. They are also, implicitly, a map of problems that silicon RL has not yet solved.
RL systems do not face predators or need sleep. They also cannot do what a macaque monkey does effortlessly: walk into a new environment and figure out, within minutes, what is worth pursuing. A modern deep RL system given a comparable problem often requires millions of trials. AlphaGo Zero, one of the most celebrated RL systems ever built, needed 4.9 million games of self-play over three days to master Go. A human professional learns the same game through a few thousand games over a lifetime. Fewer games. More time. But vastly more learning extracted from each experience. Understanding that gap is one of the field’s deepest open questions.
The convergence Schultz, Dayan, and Montague established set the terms for that question. It showed that RL and neuroscience are not parallel fields that occasionally borrow vocabulary from each other. They are tracking the same underlying problem. Each revolution in one field reshapes what the other knows how to ask. That spiral is not finished.
The brain’s quirks, its adaptive solutions, are the field’s next research agenda. Efficient generalization, real-time adaptation, learning from sparse experience: the problems silicon RL has not yet solved are precisely the ones evolution spent hundreds of millions of years working on.
The deepest of those problems has a name. Richard Bellman gave it one in 1957. It is still unsolved.
Next in The RL Spiral: The Curse Bellman Couldn’t Break.


