The RL Spiral, Part 8: The Open Questions
Transfer, adaptation, curiosity, and the nature of reward itself. Four open problems define RL’s frontier. Evolution has been working on them for 500 million years.
This is the final article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, RL Meets the Physical World, showed how RL is reshaping robotics through the same division of labor the brain uses. This one maps the frontier: four open problems, each connected to neuroscience, each pointing to what comes next.
This series has traced a spiral. An algorithm designed by a computer scientist in 1988 turned out to match the firing pattern of dopamine neurons recorded by a neurophysiologist in 1997. A technique for stabilizing neural network training, experience replay, invented in 1992 and essential to DQN’s breakthrough in 2013, turned out to mirror what the hippocampus does during sleep. A division of labor between model-free reaction and model-based imagination in silicon turned out to parallel the division between basal ganglia and prefrontal cortex in the brain. The fork between self-play and human feedback turned out to map onto the brain’s own distinction between individual learning and social cognition. At every turn, RL and neuroscience have arrived at the same solutions through completely different routes, and each convergence has opened the next research frontier.
The spiral is not finished. The places where RL falls short are, with striking regularity, the places where the brain does something RL has not yet learned to approximate. The four problems that follow are not predictions about the future. They are a map of the present frontier, drawn from the gap between what RL can do and what biology already does.
Transfer
A child who learns to pour water from a jug into a glass can, the next day, pour juice from a bottle into a bowl. The skill transfers. Not perfectly, not without some adjustment, but without starting over. The child does not need ten thousand more pouring episodes to learn that bottles work like jugs and bowls work like glasses. Something about the structure of “pouring” has been extracted from the first task and applied to the second.
RL cannot do this reliably. A policy trained to pour water from a specific jug into a specific glass, in a specific position on a specific table, does not transfer to a bottle and a bowl without retraining. Article 3 traced this to the curse of dimensionality: the policy was learned in a narrow region of the state space and has nothing to say about the region where bottles and bowls live. Article 7 showed that domain randomization helps, by training across varied conditions so the policy generalizes better. But even with extensive randomization, the generalization is shallow. It handles variations within a task. It does not handle transfer across tasks.
The neuroscience question is: what is a skill, neurologically? When a child learns pouring, what exactly is stored? Not a muscle sequence, because the sequence changes with every new container. Not a visual template, because jugs and bottles look different. Something more abstract: a causal model of how tilting a container controls the flow of liquid, combined with a control strategy for regulating the tilt based on visual feedback. This abstract representation is what transfers. The new container activates the same representation. Only the specific parameters need updating.
Neuroscience research has shown that this abstraction has a specific neural signature. When subjects perform a familiar skill with a new tool, brain imaging shows strong activation in premotor and parietal regions associated with the abstract action plan, and relatively weak activation in regions associated with specific object features. The brain does not re-learn the action from scratch. It retrieves the abstract plan and adapts the details. Motor cortex adjusts the force profile. Visual cortex adjusts the tracking. But the high-level structure of the action, tilt, monitor flow, stop when full, remains unchanged. The skill lives at a level of abstraction that is invariant to the specific objects involved.
Building this kind of transferable abstraction is the deepest challenge in representation learning for RL. The brain does it through hierarchical compression across a lifetime of experience, as Article 3 described. Each new skill is not learned from scratch. It is attached to an existing scaffold of previously learned abstractions. The two-year-old who pours for the first time is not starting from zero. The two-year-old has already spent two years grasping, tilting, dropping, and watching liquids behave. The pouring skill is the last layer on top of that stack.
RL agents have no stack. Each task starts from a randomly initialized network. The field’s response, foundation models for robotics, pre-trained on diverse experience and fine-tuned for specific tasks, is an explicit attempt to build the stack artificially. The approach is early. The results are promising but narrow. The gap between a foundation model’s transfer and a toddler’s transfer remains measured in orders of magnitude.
Adaptation
Standard RL assumes the world stays the same. The math behind every algorithm in this series, from Bellman’s equation in Article 3 to TD learning in Article 2 to DreamerV3’s world model in Article 6, assumes that the rules of the environment do not change while the agent is learning. This assumption is called stationarity. It is mathematically necessary for convergence guarantees. It rarely holds in practice.
Markets shift. Seasons change. User preferences evolve. A robot deployed in a warehouse learns the layout, and then the warehouse is reorganized. A language model is trained on 2024 data and deployed in 2026, where the slang, the politics, and the cultural references it learned are already drifting out of date. A self-driving car learns to navigate a city, and then a construction project closes a highway for eighteen months. In each case, the policy the agent learned is optimized for a world that no longer exists. The agent must adapt. Standard RL has no principled mechanism for deciding when to adapt, how fast to adapt, or how much of the old policy to keep.
The brain adapts constantly, and in 2002, Kenji Doya proposed a framework for how. The same Doya whose 1999 work on the basal ganglia and cerebellum appeared in Article 7 went deeper in a follow-up paper, arguing that the brain adjusts its own learning parameters in real time through four chemical systems, each controlled by a different neurotransmitter.
Dopamine, the signal Article 2 traced in detail, controls reward sensitivity: how strongly the brain responds to positive and negative outcomes. Serotonin controls time horizon: how far into the future the brain looks when evaluating consequences. This is the biological equivalent of the discount factor, the parameter in every RL algorithm that determines whether the agent prioritizes immediate reward or long-term return. Acetylcholine controls learning rate: how quickly the brain updates its model when new evidence arrives. Under high uncertainty, acetylcholine levels rise, and the brain learns faster from each observation. Under stability, levels drop, and learning slows. Noradrenaline controls exploration: how much the brain deviates from its current best strategy to try new approaches. Under environmental volatility, noradrenaline rises, pushing the system toward exploration. Under stability, it drops, favoring exploitation of known strategies.
Four parameters. Four chemical systems. Each one adjusting in real time based on the current state of the environment. This is not a metaphor for hyperparameter tuning. It is literal biological hyperparameter tuning, implemented in neurochemistry, operating continuously, without any external scheduler or human engineer deciding when to change the learning rate.
AI has spent decades developing meta-learning as a research agenda: algorithms that learn to learn, that adjust their own parameters based on experience. The brain has been doing it with neuromodulators for hundreds of millions of years. The contrast is the point. The most sophisticated meta-learning systems in AI adjust a handful of parameters across training runs. The brain adjusts four fundamental learning parameters at every moment, in real time, based on signals about uncertainty, volatility, time horizon, and reward magnitude. The gap between these two systems is enormous. Closing it is one of the most productive directions in RL research, and Doya’s framework remains the clearest map of what a closed gap would look like.
Curiosity
The first article in this series opened with a burning spinner: an RL agent that maximized its reward function without doing what its designers intended. Every article since has circled the same problem from a different angle. Reward specification is hard. Reward functions contain gaps. Sufficiently capable optimizers find those gaps. The entire field of AI alignment exists because this problem has not been solved.
One response, explored in Article 5, is to get the reward signal from humans rather than from a function: RLHF, Constitutional AI, debate. These approaches shift the problem from “how do you write the right reward function” to “how do you elicit the right human judgments.” They help. They do not eliminate the specification problem, because human judgments are themselves proxies for values that are never fully articulated.
A more radical response is to ask whether agents need external reward at all.
Children explore without being rewarded for it. A toddler who opens and closes a cabinet door fifty times in a row is not receiving external reinforcement. Nobody is awarding points. The behavior is driven by something internal: a satisfaction in discovering how things work, a pull toward the novel and the surprising. Developmental psychologists call this intrinsic motivation. It is one of the most robust and universal features of human development.
The neuroscience of intrinsic motivation connects directly to the dopamine system Article 2 described, but with a twist. Dopamine neurons do not fire only on external reward. They also fire on novelty, on unexpected information, on stimuli whose relevance is not yet known. The prediction error signal that Schultz discovered is not exclusively a reward signal. It is a learning signal that flags anything the system has not yet predicted, whether or not that thing is rewarding. The dopamine burst on a novel stimulus says “pay attention” before the system has determined whether the stimulus is good or bad.
RL researchers have built on this observation to create curiosity-driven agents: systems that receive internal reward for encountering states they have not predicted well. The agent is rewarded not for achieving a goal but for reducing its own prediction error, for exploring the parts of the environment it understands least. These agents can learn complex behaviors in environments with no external reward at all. They explore mazes, discover tools, and build models of their environment purely from the drive to understand it. In Montezuma’s Revenge, the game that defeated DQN in Article 4, curiosity-driven agents discovered rooms and collected keys that a pure reward-seeking agent would never find, because the curiosity signal rewarded exploration of the unknown rather than waiting for a score that might never come.
The limitation is that curiosity without goals can be aimless. An agent driven purely by prediction error will be fascinated by static on a television screen, because random noise is maximally unpredictable. This is called the “noisy TV problem” in the literature, and it is more than a quirk. It reveals that raw prediction error is not the same thing as meaningful novelty. Biological curiosity does not work this way. It is shaped and directed by the organism’s needs, its developmental stage, and its social context. A toddler does not find random noise interesting. The toddler finds cabinet doors interesting, because doors are at the right level of complexity for a brain that is currently building models of how physical objects move. The toddler finds faces interesting, because social information is high-value for a species that depends on cooperation. The curiosity is targeted, not random. It is tuned to the frontier of what the organism is ready to learn next. Building agents whose internal drives are similarly structured, curious about the right things at the right time, remains an open problem.
The Nature of Reward
Every article in this series has assumed that learning is driven by reward. The agent does something. A consequence follows. The consequence is evaluated. The evaluation updates the agent’s behavior. This is the framework Thorndike described in 1898, Bellman formalized in 1957, Sutton operationalized in 1988, and Schultz found in dopamine neurons in 1997. It is the foundation of everything this series has discussed.
Karl Friston thinks it might be wrong.
Friston, a neuroscientist at University College London, has developed a framework called active inference that proposes a fundamentally different account of what the brain is doing. In Friston’s framework, the brain does not maximize reward. It minimizes surprise. The brain maintains a generative model of the world: a set of predictions about what sensory input to expect. When the predictions match reality, nothing happens. When they do not, the brain has two options: update the model to better predict the world, or act on the world to make it match the predictions. Perception is the first option. Action is the second. Both serve the same objective: minimizing the gap between what the brain expects and what actually arrives.
In this framework, reward is not the driving force. It is a special case. Consider a simple example. An organism that expects to be fed at noon and is not fed experiences a prediction error. The resulting behavior, seeking food, can be described as reward-seeking. But in Friston’s account, the behavior is driven not by a desire for food but by a drive to resolve the prediction error: the expectation of food was violated, and the system acts to restore the prediction. The organism does not say “I want food.” The organism says, in effect, “my model predicted food and there is no food, and I must act until my sensory experience matches my model again.” The distinction sounds subtle. Its implications are not.
If Friston is right, then the entire RL framework, with its reward functions, value estimates, and policy optimization, is describing a downstream consequence of a more fundamental process. The brain is not a reward maximizer that happens to predict. It is a prediction machine that happens to seek reward when predictions about bodily states are violated. Hunger is not a reward signal. It is a prediction error: the body expects certain metabolic inputs, those inputs are absent, and the resulting discrepancy drives behavior until the prediction is satisfied. The reframing is total. Every behavior RL explains through reward, active inference explains through prediction error. The question is whether the reframing buys anything that RL cannot already provide.
Article 6 introduced Friston’s predictive coding framework as a theory of how the brain builds world models. Active inference extends this to action. The agent does not just predict the world. It acts to make the world conform to its predictions. A hungry organism does not just predict that food is absent. It acts to make food present, because the absence of expected food is a prediction error that the system is driven to resolve. Action and perception are unified under a single objective.
The debate between RL and active inference is live and unresolved. Some theorists argue that RL is a special case of active inference: the reward-maximizing behavior RL describes emerges naturally from free energy minimization, Friston’s mathematical formalization of the drive to reduce prediction error, under certain conditions. Others argue the two frameworks make genuinely different predictions about behavior, particularly in situations involving exploration, uncertainty, and novel environments. Active inference agents explore in order to reduce uncertainty about their model. RL agents explore in order to find reward. In many situations, these drives produce the same behavior. In some, they diverge. The divergent cases are where the debate will ultimately be settled.
For this series, the significance of active inference is not in the technical details. It is in the question it asks. The RL tradition, from Thorndike’s puzzle box through Bellman’s equation through TD learning through AlphaGo, has always assumed that the right question is: how does a system learn to maximize reward? Friston’s tradition asks: is maximizing reward even what the system is doing? If the answer turns out to be no, then the next generation of AI architectures may need a fundamentally different objective function. Not reward maximization, but surprise minimization. Not optimization of an external signal, but maintenance of an internal model.
This is the most radical open question in the series. The first three questions ask how to fix RL: how to make it transfer, adapt, and explore without external reward. The fourth asks whether RL’s foundational assumption is correct. It is the question where the spiral opens, rather than closes.
This series began with a boat on fire and ends with a question about whether reward is real.
The path between them has traced three neuroscience traditions, each one deeper than the last.
Wolfram Schultz’s tradition, the subject of Article 2, answers the question: how does the brain learn from reward? The answer, temporal difference prediction error encoded in dopamine, gave RL biological validation and gave neuroscience a formal language for what dopamine does. This tradition confirms that the brain and RL algorithms are solving the same problem with the same mathematical structure.
Kenji Doya’s tradition, the subject of Article 7 and the Adaptation section above, answers a different question: how does the brain adjust learning itself? The answer, four neuromodulatory systems that tune reward sensitivity, time horizon, learning rate, and exploration in real time, reveals a level of sophistication that RL meta-learning has not yet matched. This tradition does not challenge RL’s framework. It deepens it. It shows that the brain does not just learn from reward. It learns how to learn from reward, and it adjusts that process continuously.
Karl Friston’s tradition, introduced in Article 6 and developed above, asks the deepest question: is reward the right framework at all? The answer, still contested, is that the brain may not maximize reward but minimize surprise, with reward-seeking emerging as a consequence of prediction error minimization rather than as a driving force. This tradition goes beneath RL’s foundations.
Three questions. Each goes deeper. How the brain learns from reward. How the brain adjusts learning. Whether the brain learns from reward at all.
The RL-neuroscience dialogue has been spiral from the beginning. Thorndike observed that consequences shape behavior. Bellman formalized the mathematics of sequential decision-making. Sutton turned the mathematics into a learning algorithm. Schultz found the algorithm in the brain. Each revolution moved between engineering and biology, each one informing the next. The spiral has not stopped. It has accelerated. The world model research Article 6 described draws explicitly on neuroscience. The embodied RL Article 7 described maps onto Doya’s neural architecture. The open questions above are defined by the gaps between what silicon systems can do and what biological systems already do.
The next major insight in RL will likely come from biology. Not because the brain is a perfect system to imitate, it is not, but because it is the only system that has operated under the constraints RL now faces: limited experience, changing environments, ambiguous goals, physical embodiment, and the need to learn quickly from sparse evidence. Evolution has had hundreds of millions of years to work on these problems. The solutions it found are not blueprints. They are evidence about what works. And the track record of the spiral suggests that when RL researchers look carefully at what biology does differently, they find not just inspiration but structure: mathematical structures that transfer, computational principles that scale, and design patterns that solve exactly the problems the engineering is stuck on.
This is how the field has always made its biggest leaps. Sutton did not study the brain to write TD learning. Schultz did not study RL to record dopamine neurons. The convergence was accidental. But every major advance since, from experience replay to world models to the neuroscience of meta-learning, has come from someone on one side of the spiral recognizing a solution that already existed on the other side.
The spiral continues. We are still inside it.
This is the final article in The RL Spiral. The series will be expanded into a book, with additional character work, scene-setting, and technical depth. If you have followed the series this far, thank you. You are now exploring the deepest questions in reinforcement learning. The answers are still being written.


