The RL Spiral, Part 6: The World Inside

DQN gave RL eyes. AlphaGo gave it ambition. Neither gave it imagination. The most important research direction in RL today is fixing that.

Mar 30, 2026

This is the sixth article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, The Self-Play Paradox, showed why self-play produces superhuman game players but cannot align AI to human values. This one is about what happens when RL agents stop reacting and start imagining.

In 2023, a reinforcement learning agent collected diamonds in Minecraft.

This does not sound like a milestone. But consider what collecting a diamond in Minecraft requires. The player starts in a randomly generated world with nothing. To reach diamond ore, you must first punch a tree to get wood. Craft the wood into planks. Craft the planks into a crafting table. Use the crafting table to make a wooden pickaxe. Mine stone with the wooden pickaxe. Use the stone to make a stone pickaxe. Mine iron with the stone pickaxe. Smelt the iron in a furnace, which itself must be crafted from stone. Make an iron pickaxe. Dig deep underground, navigating caves and avoiding lava. Find diamond ore. Mine it with the iron pickaxe. Each step depends on the last. Skip one, and the chain breaks. The entire sequence takes a human player roughly ten minutes. No reward arrives until the diamond is collected.

Previous RL systems had failed at this task. The problem was the one Article 4 identified in Montezuma’s Revenge: when reward is sparse and delayed, an agent that learns only by reacting to what happens next has no signal to follow. Random exploration in a space this large will never stumble into the right sequence. The agent needs to plan. And planning requires something DQN never had: a model of how the world works.

The system that solved it was called DreamerV3. Built by Danijar Hafner and colleagues at DeepMind, it learned not by reacting to the environment moment by moment but by building an internal model of the environment and then practicing inside that model. The agent would observe the world, update its internal model, and then imagine hundreds of possible futures before choosing an action. Most of its learning happened inside its own head. It rehearsed futures that never occurred, evaluated strategies it never executed, and arrived at decisions informed by experience it never physically had.

DreamerV3 collected the diamond from scratch. No human demonstrations. No hand-built curriculum. No game-specific engineering. The same algorithm, with the same settings, was then tested on over 150 different tasks spanning robot control, Atari games, and 3D navigation environments. It outperformed specialized methods designed for each domain. In April 2025, the results were published in Nature.

The idea behind DreamerV3, that an agent should learn a model of the world and use it to simulate possible futures, is not new. It is one of the oldest ideas in the field. Someone proposed it in 1990, and for three decades almost nobody could make it work at scale. Understanding why it took so long, and why it is now the shared frontier of robotics, language models, and game-playing AI, requires going back to a fork in the road that happened at the very beginning of reinforcement learning.

The Fork

Article 3 described how Richard Bellman formalized sequential decision-making in the 1950s, and how Norbert Wiener’s cybernetics offered a parallel approach built on feedback and control. That fork produced two traditions. Control theory assumed you had a model of the system. Reinforcement learning assumed you did not.

Inside RL itself, the fork reappeared as a methodological split between model-free and model-based approaches.

A model-free agent learns a direct mapping from situations to actions. It does not know why an action works. It knows only that it works, because past experience produced reward when that action was taken in that situation. DQN is model-free. It looks at the screen, estimates how good each possible action is, and picks the best one. It has no internal representation of how Breakout’s physics work, no concept of ball trajectory or brick structure. It has learned, through millions of frames of trial and error, which buttons to press when certain patterns appear on the screen. This is powerful. It is also, as Montezuma’s Revenge demonstrated, limited. Without a model of what happens next, the agent cannot plan.

A model-based agent learns how the world works, then uses that knowledge to plan. It asks: if I take this action, what will happen? If I take that action instead, what will happen then? It simulates possible futures, evaluates them, and chooses the path that looks best. A chess player thinking five moves ahead is doing model-based reasoning. The player has an internal model of how pieces move, how captures work, and what the board will look like after each sequence. The thinking happens before the move, not after.

The model-free approach dominated RL for decades, not because researchers believed it was better in principle, but because building accurate models of complex environments turned out to be extraordinarily hard. In simple environments with a handful of states, model-based methods worked well. In environments with high-dimensional sensory input, continuous physics, and partial observability, the models were too inaccurate to be useful. A bad model is worse than no model. An agent that plans using a model full of errors will confidently execute plans that fail, because the simulated futures it evaluated bore no resemblance to what actually happens. The model-free approach avoided this problem by skipping the model entirely. Learn from what happens. Do not try to predict it.

The cost was sample efficiency. As Article 3 explained, a model-free agent must visit states to learn about them. A model-based agent can reason about states it has never visited, by simulating them in its internal model. The brain circumvents the curse of dimensionality partly through hierarchical compression, as Article 3 described, and partly through model-based reasoning: mentally simulating actions before performing them. A model-free agent has the first, if it uses deep learning for representation. It does not have the second. It has eyes but no imagination.

The question was whether the models could get good enough. For thirty years, the answer was mostly no.

Learning to Dream

Juergen Schmidhuber first proposed using a neural network as a learned world model for RL planning in 1990, in a technical report at the Technical University of Munich. The idea was explicit: train a recurrent neural network to predict what happens next, then use that network to simulate possible futures, then train the agent inside those simulations. The concept was sound. The neural networks of 1990 were too small and too unstable to learn accurate enough models of anything but the simplest environments. The idea waited.

In 2018, David Ha and Schmidhuber published “World Models,” a paper that demonstrated the concept at a scale the 1990 report could not. They trained an agent to play a simple car racing game by first learning a compressed visual representation of the environment, then learning a predictive model of how that representation changed over time, then training a small controller entirely inside the learned model. The agent could be trained “in its own dream,” as the authors put it: practicing in a simulated world that existed only in the neural network’s learned representations. The paper was influential, but the environments were still simple.

The Dreamer line of work, led by Hafner, turned this into a general method. The core innovation was a shift in what the model learns to predict. Previous model-based approaches tried to predict the next observation directly: given the current image and an action, produce the next image. This is hard. Images are high-dimensional, full of irrelevant detail, and small errors compound rapidly across multiple prediction steps. A model that slightly misplaces a shadow on step one will produce an increasingly distorted scene by step ten. Plans built on these distorted futures fail.

Dreamer avoids this by predicting in latent space. The agent first learns to compress raw observations into a compact, abstract representation, stripping away visual details that do not matter for the task. The world model then learns to predict how these compressed representations change in response to actions. The predictions stay in the compressed space. The agent never needs to reconstruct what the next frame will look like. It only needs to predict whether the next compressed state will be better or worse for earning reward.

This changes the economics of imagination. A single step of latent prediction is fast and cheap. The agent can imagine hundreds of future trajectories, each dozens of steps long, in the time it takes to collect one real experience. It evaluates each imagined trajectory against a learned value function, selects the action that leads to the best imagined outcomes, and only then acts in the real world. Most of the learning happens in imagination. Real experience is collected primarily to keep the world model accurate, not to train the policy directly.

DreamerV1, published in 2020, demonstrated this for continuous control tasks. DreamerV2, in 2021, extended it to discrete environments and surpassed human-level performance on the Atari benchmark, the same benchmark DQN had opened in 2013. DreamerV3, in 2023, made the approach general: one algorithm, one set of settings, over 150 tasks. The Minecraft diamond was the headline result, but the generality was the real achievement. No other RL algorithm had crossed so many domains without task-specific tuning.

The approach has moved beyond games. In a project called DayDreamer, Hafner’s team demonstrated something the Dreamer line had been building toward: learning directly on a physical robot, in real time, with no simulation at all. A quadruped robot was placed on the ground with no prior knowledge of how its legs worked. It fell. It observed the consequences. It updated its world model. Then it imagined what would happen if it moved its legs differently, evaluated hundreds of these imagined futures, and tried the most promising one. It fell again, but differently. Within roughly an hour, by alternating between brief real-world attempts and extended sessions of imagined practice, the robot was walking. The world model did not need to be perfect. It needed to be good enough that the imagined futures pointed in approximately the right direction. Each real-world correction made the model slightly better. Each round of imagination made the policy slightly better. The loop converged in an hour. No simulation environment. No pre-training. One robot, one world model, learning from its own clumsiness.

The Brain’s World Model

The brain runs both systems.

The model-free vs model-based fork that separates DQN from DreamerV3 maps directly onto a distinction neuroscientists have studied for decades: habitual versus goal-directed behavior. The dorsal striatum, a region within the basal ganglia, drives habitual responding. See a stimulus, produce a response. No deliberation, no simulation, no consideration of what might happen next. A driver on a familiar commute turns left at the intersection without thinking. The turn is triggered by the visual cue of the intersection, not by a mental simulation of the route. This is model-free control: fast, automatic, and blind to changes. If the usual road is closed, the habitual system will still try to turn left.

Goal-directed behavior runs on different hardware. The prefrontal cortex and the hippocampus work together to simulate possible outcomes before an action is taken. A driver encountering a road closure stops, considers alternatives, mentally traces each route, and chooses. This is model-based control: slower, effortful, and flexible. It requires an internal model of the environment, the map of streets the driver holds in memory, and the ability to simulate the consequences of actions within that model.

The brain does not choose one system. It runs both, and which one dominates depends on the situation. Novel tasks and high-stakes decisions engage the model-based system. Familiar, well-practiced behaviors default to the model-free system. Stress, fatigue, and time pressure push the brain toward model-free responding, even in situations where deliberation would produce better outcomes. This is why people under pressure fall back on habits, even bad ones. The model-based system is more accurate but more expensive. The model-free system is cheaper but inflexible. The brain manages the trade-off continuously, allocating cognitive resources where they are most needed.

The evolutionary history of this trade-off mirrors the history of RL itself. Model-free learning is ancient. The basic mechanism, do something, experience a consequence, adjust the behavior, is present in animals with the simplest nervous systems. Fruit flies learn to avoid odors associated with shock. Sea slugs learn to retract their gills from noxious touch. The dopamine system Article 2 described has chemical ancestors in invertebrates dating back over 500 million years. Stimulus-response learning is one of the oldest cognitive functions on Earth.

Model-based reasoning appeared much later. The capacity to simulate outcomes before acting depends on brain structures that emerged in vertebrates and became increasingly elaborate in mammals. The clearest evidence comes from a simple experiment. Train a rat to press a lever for food. Then, separately, make the food nauseating by pairing it with illness. The rat never experiences nausea while pressing the lever. The two events are learned in different contexts. Now put the rat back in front of the lever. A model-free system would press the lever, because the lever was associated with reward and nothing has happened at the lever to change that. A model-based system would stop, because it can combine two pieces of knowledge learned separately: the lever produces food, and the food is now bad, therefore the lever now leads to something bad. Rats stop pressing. Insects do not. The difference is an internal model that supports inference beyond direct experience.

In primates, the model-based system expanded further. The prefrontal cortex, which evaluates simulated outcomes against current goals, grew dramatically over the last several million years. In humans, it occupies roughly a third of the cortical surface. But the capacity for future-oriented planning is not exclusive to mammals. Scrub jays, members of the corvid family, cache food in locations chosen based on what they expect to need tomorrow, not what they need now. They adjust their caching strategy based on whether other birds were watching, suggesting a model of other agents’ knowledge and intentions. Corvids achieve this with different brain architecture, through structures in the avian pallium that are anatomically distinct from the mammalian prefrontal cortex but functionally parallel. Model-based reasoning, it appears, has been independently invented more than once.

The parallel to RL’s history is direct. RL started model-free: TD learning, Q-learning, DQN. Model-based was the harder problem, proposed early by Schmidhuber in 1990 but impractical for decades. The field added model-based capability on top of model-free foundations, exactly as evolution did. DreamerV3 is, in a sense, the prefrontal cortex of RL: imagination layered on top of reactive competence, arriving late, expensive to run, but transformatively powerful once it works.

Article 4 described how the hippocampus replays past experiences during sleep, compressing recent memories and transferring them to long-term storage. But hippocampal replay does more than consolidate the past. It also simulates the future.

When a rat sits at a decision point in a maze, place cells in the hippocampus fire in sequences that correspond not to where the rat has been but to where the rat could go. The sequences run forward, down paths the rat has not yet taken, at the same compressed speed as backward replay: a journey of several seconds plays out in a tenth of a second. The hippocampus is generating candidate futures and evaluating them before the rat commits to a direction. Crucially, the sequences are not random. They are biased toward paths associated with reward. A rat that found food at the end of the left arm of a maze will replay the left path more frequently when sitting at the choice point. The forward replay is not just simulation. It is goal-directed simulation.

Researchers have tested this directly. When sharp-wave ripples, the neural events associated with replay, are disrupted during waking rest at choice points, rats make worse decisions on subsequent trials. The disruption does not impair the rat’s ability to run or to perceive the maze. It impairs the rat’s ability to use imagined futures to guide its choices. This is model-based planning, implemented in biological hardware, observed at the level of individual neurons.

In humans, the system is more elaborate. The hippocampus and the prefrontal cortex work together to construct mental simulations of possible futures. Brain imaging studies show that when people imagine future events, the same network of regions activates as when they remember past events. The brain uses the same machinery for memory and for imagination. This makes functional sense: both require assembling a coherent scene from stored components. Remembering your last birthday and imagining your next one draw on the same representational infrastructure. The difference is the direction: one reconstructs the past, the other constructs a possible future. Patients with hippocampal damage who cannot form new memories also struggle to imagine novel future scenarios. They can describe the concept of a future event but cannot generate the vivid, spatially coherent scene that healthy subjects produce effortlessly. The imagination is not a separate faculty. It is memory running forward.

The prefrontal cortex contributes the evaluative component. It assesses the imagined futures against current goals, weighs costs and benefits, and selects the plan that best serves the organism’s needs. Damage to the prefrontal cortex leaves the ability to imagine intact but impairs the ability to use imagination for planning. Patients can describe possible futures but struggle to choose between them or to organize a sequence of actions toward a goal.

This architecture, a generative model that produces candidate futures plus an evaluative system that selects among them, is structurally parallel to DreamerV3. The world model generates imagined trajectories. The critic evaluates them. The actor selects the best action. The parallel is not a metaphor. It is a shared computational structure, arrived at by evolution in one case and by engineering in the other, solving the same problem: how to act well in a world you have not yet fully experienced.

The neuroscience adds a dimension the engineering has not yet matched. The brain’s world model is not one thing. It is a collection of models at different levels of abstraction, updated at different timescales, serving different functions. Low-level motor models in the cerebellum predict what will happen in the next fraction of a second when a muscle contracts: fast, precise, and operating below conscious awareness. Mid-level spatial models in the hippocampus and parietal cortex predict what lies around the corner in a familiar environment: slower, more flexible, and accessible to conscious thought. High-level causal models in the prefrontal cortex predict what will happen if you change jobs or move to a new city: slow, deliberate, and drawing on abstract knowledge that may have nothing to do with spatial navigation. These models interact, inform each other, and can be deployed flexibly depending on the task at hand. When you catch a ball, the fast motor model handles the arm trajectory while the spatial model tracks the ball’s arc. When you plan a vacation, the causal model evaluates destinations while the spatial model simulates what it would feel like to walk through a foreign city. The brain selects and combines models on the fly, without any explicit instruction about which level of abstraction the current situation requires.

Karl Friston, a neuroscientist at University College London, has pushed this logic to its furthest point. His framework, called predictive coding, proposes that the entire brain is a generative model. All perception is prediction: the brain continuously generates expectations about incoming sensory data and updates only when the predictions fail. Action is not a separate process from perception. It is another way of reducing prediction error: instead of updating your model to match the world, you change the world to match your model. In Friston’s framework, the distinction between seeing, thinking, imagining, and acting dissolves into a single process of minimizing the gap between what the brain predicts and what actually happens.

This is the same computational structure as DreamerV3, taken seriously as a theory of everything the brain does. The full implications of Friston’s framework, and the radical challenge it poses to RL’s foundational assumptions, arrive in Article 8.

The Frontier

World models are no longer a research curiosity. They are the convergence point of multiple fields.

In robotics, world models address the sample efficiency crisis that Article 3 described. A physical robot cannot afford millions of trials. Every real-world attempt takes time, wears hardware, and risks damage. A world model changes the equation. The robot collects a modest amount of real experience, builds an internal model from that experience, and then practices extensively in imagination. The real-world trials serve primarily to calibrate the model. The learning happens inside. DayDreamer demonstrated this with walking: one hour of real experience, augmented by continuous imagined practice. The same logic extends to manipulation, where a robot arm learning to grasp diverse objects can simulate thousands of grasps in its world model for every one it attempts physically. The model will be imperfect. The objects in imagination will not behave exactly like real objects. But a slightly wrong imagined grasp still teaches more than no grasp at all, and each real correction sharpens the model for the next round of imagination.

In language models, the connection is less obvious but increasingly important. Large language models trained on text have, through their training, absorbed an implicit model of how the world works. They can predict that a glass pushed off a table will fall, not because they understand gravity but because their training data contains countless descriptions of objects falling. Whether this implicit world knowledge constitutes a “world model” in the RL sense, one that can be used for planning and reasoning, is an open and actively debated question. Yann LeCun argued in a 2022 position paper that the path to more capable AI requires building explicit world models that learn from observation and predict in abstract representation spaces, not in raw pixels or text. His proposed architecture, called JEPA, shares deep structural similarities with the Dreamer approach: learn compressed representations, predict at the level of those representations, plan in latent space. LeCun left Meta in late 2025 to found AMI Labs, focused on building exactly these systems. By March 2026, AMI Labs had raised over a billion dollars.

In game-playing AI, MuZero, published by DeepMind in 2019, demonstrated that an agent could master chess, Go, shogi, and Atari without being told the rules, by learning an internal model of the game dynamics and planning within that model. MuZero’s model is implicit: it does not predict observable states, only the quantities needed for planning. It never learns what a chessboard looks like. It learns what matters for winning. This is a different approach from Dreamer’s explicit generative model, but the functional role is the same. The agent imagines before it acts. AlphaGo, from Article 5, used self-play within a known game. MuZero uses self-play within a learned game. The rules themselves are part of the model.

The convergence is striking. Researchers in robotics, language models, and game AI, working on different problems with different methods, are all arriving at the same conclusion: agents that model the world outperform agents that merely react to it. The advantage grows with the complexity of the task and the sparsity of the reward. From every corner of the field, the movement is toward agents that build internal representations of how the world works and use those representations to act before the world forces them to react. The agent that imagines outperforms the agent that merely remembers.

The next question is what happens when these imagining agents meet the physical world. Not a simulated maze or a Minecraft block but a factory floor, a hospital room, a kitchen counter with real objects that break. That is the subject of Article 7.

Next in The RL Spiral: RL Meets the Physical World.

Robonaissance

Discussion about this post

Ready for more?