Roads to a Universal World Model, Part 1: The Dreamer’s Road
The reinforcement learning path: learning by imagining
“The main idea of Dyna is the old, commonsense idea that planning is ‘trying things in your head.’” — Richard Sutton, SIGART Bulletin (1991)
Rich Sutton studied psychology at Stanford in the late 1970s. Not computer science. Psychology. He was fascinated by how animals learn, how a rat navigating a maze seems to build a map in its head, how a child learns that reaching for a hot stove produces pain without needing to touch it every time. When he arrived at the University of Massachusetts Amherst to work with Andrew Barto, the pair set out to formalize something that most AI researchers considered a dead end: learning from rewards rather than from labeled examples. Trial and error. The most unfashionable idea in the field.
“When we started, it was extremely unfashionable to do what we were doing,” Barto told Axios. “It had been dismissed, actually, by many people.”
They persisted. Over the next decade, they built the mathematical foundations of what would become reinforcement learning: an agent takes actions in an environment, observes the consequences, receives a reward signal, and learns to act better. The framework was clean. The problem was efficiency. A reinforcement learning agent had to try everything in the real world, or in a real simulation, one painful step at a time. Every mistake cost time. Every exploration required interaction. Learning was slow because experience was expensive.
Sutton, the psychologist turned computer scientist, kept thinking about what brains do differently. A rat does not need to physically walk every corridor of a maze to figure out the shortest path. After enough exploration, it seems to simulate the maze internally, to plan by imagining trajectories it has never actually taken. The neuroscience was suggestive. The question was whether a machine could do the same thing.
In 1991, Sutton published his answer. He called it Dyna.
Trying Things in Your Head
The idea behind Dyna was disarmingly simple. An agent interacting with the world does two things at once. First, it learns from real experience, the way any reinforcement learning agent does: take an action, observe what happens, update your policy. Second, it learns a model of the environment: given state S and action A, what state S’ comes next, and what reward do you get? Once the agent has even a rough model, it can use that model to generate imaginary experiences and learn from those too.
The real world gives you one experience per time step. The model gives you as many as you want.
Sutton cited Kenneth Craik’s 1943 insight directly in his paper. The idea of planning as “trying things in your head, using an internal model of the world” was the exact mechanistic hypothesis Craik had proposed half a century earlier. Dyna was an attempt to implement it.
In Sutton’s experiments, the difference was stark. A standard reinforcement learning agent, dropped into a simple grid world, needed thousands of interactions to find a good path to the goal. A Dyna agent, using even a crude model to generate imaginary experiences alongside the real ones, converged an order of magnitude faster. The agent was literally dreaming its way to competence.
But Dyna worked in grid worlds. Small, discrete, fully observable grid worlds where the model could be stored as a lookup table: in state 47 taking action “north” leads to state 48 with certainty. That is not really learning dynamics. It is recording them. The real world is none of those things. It is continuous, high-dimensional, partially observable, and governed by dynamics too complex to store in any table. Extending Dyna’s insight to anything resembling reality would require a model that could generalize: predict what happens in states it has never seen, from observations it has never encountered. That is a fundamentally harder problem, and it raised a question no one could yet answer: how good does such a model need to be? Does it need to predict the future perfectly, or just well enough to support good decisions?
That question would take three decades to resolve.
The Compounding Error Problem
The core promise of model-based reinforcement learning is seductive: learn a model, then plan inside it. Use the model to simulate trajectories, evaluate strategies, and train policies without ever touching the real environment. In theory, this is a massive advantage. In practice, for nearly two decades after Dyna, it was a trap.
The problem is compounding errors. A learned model is never perfect. It might predict the next state with 95% accuracy, which sounds impressive until you try to roll it forward twenty steps. Each step’s small errors feed into the next step’s inputs. By step five, the simulated trajectory has drifted noticeably from reality. By step twenty, the agent is planning inside a hallucination.
This is not a minor technical difficulty. It is an architectural pathology. The better your model, the more you are tempted to plan further ahead. But the further you plan, the more errors compound, and the worse your plans become. Model-free methods, which learn directly from experience without building any explicit model, avoided this trap entirely. They were sample-inefficient but at least they were grounded in reality. By the early 2000s, model-free approaches dominated the field.
The dream of learning by imagining was not dead. But it was shelved, filed under “elegant in theory, disastrous in practice.” Reinforcement learning exploded through the 2010s. Deep Q-Networks mastered Atari games. Proximal Policy Optimization became the workhorse of practical RL. The dominant paradigm was model-free: learn to act directly from experience, without building any internal model of the world. Even AlphaGo, which used a model of the game’s rules for tree search, learned its value and policy networks through model-free methods. The lesson the field took away was that raw experience plus powerful function approximation could solve problems that had seemed impossible. Who needed to dream?
And none of these systems were sample-efficient. Training an Atari agent required millions of frames. Training AlphaGo required millions of self-play games. The cost of relying purely on real experience was an insatiable appetite for it. A human child learns from orders of magnitude less data. Something was missing.
The Car That Learned to Drive in a Dream
In March 2018, David Ha and Jürgen Schmidhuber posted a paper with a title that would have seemed audacious for any other pair of authors. They called it “World Models.”
Ha was a former Goldman Sachs trader turned AI researcher at Google Brain in Tokyo. Schmidhuber, based at the Swiss AI lab IDSIA, had spent decades arguing for the importance of learning world models and training agents inside them, publishing foundational work on this idea as early as 1990. The paper they wrote together was technically rigorous but also unusually playful. It came with an interactive website where visitors could watch the agent learn, see through its eyes, and explore its dreams.
The task was deceptively simple: drive a car around a randomly generated race track. The agent saw only raw pixels. It controlled steering, acceleration, and braking. The twist was the architecture. Instead of training a single network end-to-end, Ha and Schmidhuber decomposed the agent into three parts.
First, a vision module compressed each raw image frame into a compact latent representation, a vector of just 32 numbers. All the visual richness of the scene, the track curves, the grass, the car’s position, distilled into a thumbnail sketch. Second, a memory module learned to predict the next latent state given the current state and the chosen action. This was the world model: not a model of pixels, but a model of the compressed representation. Third, a tiny controller, just a linear layer with 867 parameters, learned to select actions based on the current latent state and the memory’s hidden state.
The revelation was the training procedure. After learning the vision module and the memory module from random exploration data, Ha and Schmidhuber trained the controller entirely inside the dream. The agent never drove on the real track during policy training. It drove inside the world model’s imagination, on hallucinated tracks generated by the memory module. Then they transferred the resulting policy back to the real environment.
It worked. The agent scored 906 out of 1000 on the CarRacing benchmark, surpassing all previous deep RL methods, which had scored in the 590-840 range. And the controller that produced those results had fewer than a thousand parameters. The intelligence was not in the policy. It was in the model.
But the paper’s most revealing moment was not the score. It was what happened when the dream went wrong. Ha and Schmidhuber noticed that agents trained entirely in the dream sometimes learned to exploit glitches in the world model. In one case, the agent discovered it could gain reward by driving off the track in a specific way that the model could not accurately simulate. The agent had found a bug in its own imagination and was gaming it.
This was the compounding error problem wearing new clothes. The model was good enough to train in, most of the time. But it was not good enough to trust completely. The dream, it turned out, needed to stay honest.
Dreaming in Abstract
Danijar Hafner had been circling the same question from a different angle. After studying at UCL’s Gatsby Computational Neuroscience Unit, he joined Google Brain and developed PlaNet in 2018: a model-based agent that learned to plan entirely in latent space, without ever decoding back to pixels. Where Ha and Schmidhuber’s world model still produced blurry video frames that humans could inspect, PlaNet operated in the dark. Its world model predicted future latent states, and the agent planned in that abstract space, never rendering what the predicted future would “look like.”
This was a conceptual shift. PlaNet treated the world model as an instrument, not a simulator. You did not need to reconstruct every pixel of the future to act wisely in it. You needed to predict the features that mattered for your decisions.
Hafner continued to push this idea through his PhD at the University of Toronto with Jimmy Ba, developing Dreamer in 2019. Where PlaNet used online planning, searching for good action sequences at every decision point, Dreamer learned a policy by backpropagating value gradients through imagined trajectories. The agent would start from a real state, roll the world model forward fifteen steps in latent space, estimate the value of the resulting trajectory, and update the policy to produce better trajectories. Imagination became differentiable. The agent did not just dream. It learned from the texture of the dream, from the gradient signal flowing backward through the imagined future.
DreamerV2, in 2020, achieved human-level performance on the Atari benchmark using discrete latent representations. The world model encoded each image as 32 categorical variables with 32 classes each, producing a sparse binary vector of length 1024 with just 32 active bits. This compact encoding was enough to capture the dynamics of dozens of different Atari games without any game-specific tuning.
Then came DreamerV3.
One Algorithm, 150 Tasks
In January 2023, Hafner and his collaborators published a paper with a claim that would have been hard to believe five years earlier. DreamerV3, they reported, solved over 150 diverse tasks using a single set of hyperparameters. No tuning. No domain-specific adjustments. The same algorithm, with the same configuration, learned continuous robot control, discrete Atari games, procedurally generated navigation challenges, and 3D exploration tasks.
The headline result was Minecraft. Specifically, the long-standing challenge of mining diamonds from scratch, starting from nothing in a randomly generated world, without human demonstrations or curricula. Previous attempts had either relied on 70,000 hours of human gameplay videos for pre-training or required hand-designed reward shaping to guide the agent through the twelve intermediate steps between starting empty-handed and holding a diamond. DreamerV3 did it from scratch. Sparse rewards only: plus one for each milestone reached. No human data. No curriculum.
It took about nine days of gameplay before the agent collected its first diamond. That is a long time. But it solved the task from raw pixels and sparse rewards in an open, procedurally generated 3D world, something no prior algorithm had accomplished without human scaffolding. The agent had learned to chop trees, craft tools, mine stone, build a furnace, smelt iron, and dig deep underground, all through imagined experience.
The architectural philosophy behind DreamerV3 was a set of design choices aimed at one goal: robustness across domains without tuning. The key innovations were not any single breakthrough but a collection of stabilization techniques. A symmetrical logarithmic transform squashed extreme values in observations and rewards, preventing any single domain’s scale from destabilizing training. Percentile-based return normalization let the same algorithm handle environments where rewards ranged from 0.01 to 10,000. Discrete latent representations prevented posterior collapse. Each technique was individually modest. Together, they produced an algorithm that simply worked, out of the box, across a wider range of problems than any previous method.
The DreamerV3 paper was published in Nature in 2025, an unusual venue for a reinforcement learning result and a signal of how seriously the broader scientific community was taking world models.
But DreamerV3 still required online interaction. Every task meant playing the actual game, collecting real experience, and building the model from that experience. For robotics, where real-world interaction is slow, expensive, and sometimes dangerous, this was a serious constraint. Hafner’s next step addressed it directly.
Learning to Dream from Watching
In September 2025, Hafner published Dreamer 4. The architecture was entirely new: a transformer-based world model trained with a “shortcut forcing” objective borrowed from the diffusion model literature, replacing the recurrent state-space model that had powered every previous Dreamer. But the real shift was not architectural. It was philosophical.
Dreamer 4 learned its world model primarily from unlabeled video. It absorbed large quantities of recorded Minecraft gameplay, with no action labels, no reward signals, just raw footage of the world behaving. A small amount of action-labeled data taught the model how actions relate to outcomes. The vast majority of its knowledge came from watching.
This changed what the Dreamer could do. DreamerV3 had collected diamonds in Minecraft through online play, interacting with the game for nine days. Dreamer 4 collected diamonds from a fixed offline dataset, without touching the environment at all. The agent practiced entirely inside its own imagination, and the imagination was built from video it had watched, not experience it had lived. It outperformed OpenAI’s VPT offline agent while using a hundred times less data.
The world model also ran in real time. A human could interact with Dreamer 4’s simulation at 21 frames per second on a single GPU, steering through a Minecraft world that existed only in the model’s predictions. The dreams, for the first time in the Dreamer lineage, were fast enough and accurate enough to be interactive.
What the Dreamer Taught Us
The Dreamer lineage, from Sutton’s grid worlds to Hafner’s offline Minecraft agent, spans thirty-four years. The core idea never changed: learn a model, dream inside it, get better faster. What changed was our understanding of what a world model needs to be.
The first lesson is about fidelity. Early model-based RL assumed that a good model meant an accurate model, one that could predict the next state as precisely as possible. The compounding error problem seemed to confirm this: imperfect models produce useless plans. But the Dreamer line of work revealed a subtler truth. The model does not need to reproduce every pixel of the future. It needs to capture the structure that matters for the task. DreamerV3’s dreams look nothing like the environments they represent. They are abstract, compressed, operating in a latent space that no human could interpret visually. Yet they contain enough structure for the agent to learn effective behavior.
This is the Dreamer’s Road’s central insight: structural fidelity, not perceptual fidelity. A world model is good not when it produces convincing images, but when it supports good decisions.
The second lesson is about robustness. For twenty years, model-based RL was bottlenecked by the gap between simple environments where learned models worked and complex environments where they did not. The gap looked theoretical. It was not. DreamerV3 closed it through engineering: normalizing, balancing, and stabilizing the training process so that the same architecture could absorb very different kinds of experience. No single technique was the breakthrough. The combination was.
The third lesson is about where the knowledge comes from. DreamerV3 learned everything from scratch, building a new world model for every task through online interaction. Dreamer 4 began to loosen this constraint: it learned a world model mostly from watching, absorbing physics and game mechanics from unlabeled video, then trained its policy entirely offline. This is a step toward the kind of general knowledge absorption that biological learners take for granted. But only a step. Dreamer 4’s knowledge is still specific to Minecraft. It does not transfer what it learned about block physics to robot manipulation. A child’s world model works differently. It is not rebuilt for every new toy. It builds on everything the child already knows about gravity, surfaces, friction, and the behavior of objects.
A universal world model, the destination this series is tracing, would need to go further: learn general physical knowledge from diverse experience and carry it forward into every new task, without starting over.
The Road Ahead
Rich Sutton received the 2024 Turing Award, alongside his mentor Andrew Barto, for developing the conceptual and algorithmic foundations of reinforcement learning. In his acceptance, Sutton returned to the themes that had animated his career since his psychology days at Stanford: prediction, planning, internal models. The ideas Craik proposed in 1943, that Sutton formalized in Dyna in 1991, that Ha and Schmidhuber demonstrated in 2018, that Hafner scaled across domains in 2023 and to offline learning in 2025, had become central to the field’s ambitions.
Shortly after publishing Dreamer 4, Hafner left Google DeepMind. “It really feels like a chapter coming to an end,” he wrote, looking back on nearly a decade that had taken the Dreamer from a simple continuous control agent to a system that could master Minecraft by watching videos. The Dreamer series established something the field will not unlearn: you can build competent agents by teaching them to imagine.
The Dreamer’s Road proved that world models work. It also revealed their limits. Even Dreamer 4, for all its advances, builds a world model specific to one domain. It absorbs Minecraft physics from Minecraft video. For a world model that understands physics in general, not just the physics of one game or one robot, you might need to start from a different place entirely.
You might need to start from the equations themselves.
Next: Part 2, “The Physicist’s Road,” traces the simulation path to world models, from flight simulators to NVIDIA’s Omniverse, and asks: what happens when you try to build the world from first principles?


