The World Model Research Landscape

Who is building the machine that understands physics, and why they can’t agree on how

Mar 15, 2026

In September 2025, Danijar Hafner published a paper showing that an AI agent could mine diamonds in Minecraft without ever playing the game. It learned entirely from watching videos, then practiced inside its own imagination. Three months later, Yann LeCun left Meta to launch a startup that just closed a $1.03 billion seed round at a $3.5 billion valuation, built on the thesis that Hafner’s entire approach is wrong. In January 2026, NVIDIA announced that its Cosmos world foundation models had been downloaded over two million times by robotics developers who disagreed with both of them.

Three groups. Three irreconcilable ideas about how machines should learn physics. Each backed by enough capital and talent to reshape the field if they turn out to be right. And they are not the only ones.

Five Roads, One Destination

Most coverage of world models sorts the field in one of two ways. Sort by company, and you get a leaderboard. Sort by date, and you get a timeline. Both are useful. Neither explains why Hafner and LeCun, working toward the same goal, would look at each other’s work and see a fundamental error.

The sort that explains this is by research tradition: the inherited set of assumptions, architectures, and data sources that determines what a team considers a world model, what counts as evidence that it works, and what “understanding physics” means for a machine. These are not stylistic preferences. They are different theories of understanding.

When you organize the landscape by research tradition rather than by company or chronology, five distinct roads emerge: the Dreamer’s Road builds world models through imagination and reinforcement learning. The Physicist’s Road encodes known physics into simulations. The Cinematographer’s Road extracts physical intuition from watching video. The Robot’s Road learns from real physical experience. The Architect’s Road predicts in abstract representation space, discarding pixels entirely.

The framework is not an arbitrary grouping. It is the organizing principle that makes three otherwise invisible patterns legible. First, the traditions with the most training data have never controlled a physical robot, while the traditions closest to working robots have the least data. Scale and groundedness are currently inversely correlated. Second, the word “world model” means something different in each tradition, and the definitions are not converging. Third, techniques are flowing between traditions in selective, asymmetric ways: video pretraining is entering robot learning, but robot data is not entering video generation. The roads are merging. They are not merging into one road.

One structural note: Google DeepMind appears in multiple rows of the table below. This is not an error. It is the only organization with serious efforts across all five traditions, a diversified portfolio that hedges against any single road failing. Where DeepMind contributes to a tradition, it is noted under that tradition rather than treated as a separate entry.

The Map

* NVIDIA’s stated figure for Cosmos training data. + Physical Intelligence’s stated figure for pi-zero training corpus. ++ Meta’s stated figure for V-JEPA 2 training data.

The most striking column is the last one. Every tradition’s key limitation is something another tradition has solved. None has solved the whole problem.

The Dreamer’s Road: Learning by Imagining

The oldest intellectual lineage on this map traces back to Rich Sutton’s Dyna architecture in 1991: an agent that learns a model of its environment and practices inside that model. For two decades, the approach remained elegant in theory and fragile in practice. Learned models accumulated errors with each prediction step, and plans built on those predictions collapsed.

Danijar Hafner’s Dreamer series broke through by accepting a counterintuitive trade: the model does not need to predict what the world looks like. It needs to predict, given a specific action, what happens next. This makes it an action-conditioned simulator: you feed it a state and a proposed action, and it rolls forward to the next state. The model learns what matters through reward. If predicting some aspect of the environment helps the agent score higher, that aspect gets learned. If not, it gets discarded. Understanding, on this road, emerges from trying to act. DreamerV3, published in Nature in 2025, became the first single algorithm to collect diamonds in Minecraft across over 150 diverse tasks with no task-specific tuning. Then Dreamer 4, published in September 2025, changed the rules entirely. It achieved the same challenge purely from offline video, without ever interacting with the game environment. Previous Dreamers learned by doing. Dreamer 4 learns by watching, then practices in its own imagination.

The detail that makes this tangible: Dreamer 4’s world model is accurate enough that a human player can control an imagined robot arm in real time, picking up objects and flipping bowls in a simulation generated entirely by the learned model. It has learned enough about physics to serve as an interactive simulator, not just a planning tool.

The limitation is equally clear. Each Dreamer builds a world model from scratch for each domain. Minecraft physics do not transfer to robot locomotion. The knowledge is deep but narrow. For a world model that carries understanding across domains, this road needs something it does not yet have: a way to accumulate and reuse knowledge.

The Physicist’s Road: Building the World from Equations

If the Dreamer’s Road discovers physics through experience, the Physicist’s Road encodes it by design. Flight simulators have done this for decades. NVIDIA’s Omniverse extends the principle to arbitrary physical environments, and its Cosmos platform, launched at CES 2025 and updated through early 2026, represents the most ambitious bet on this road.

Cosmos is not a single model but a family. Cosmos Predict generates physics-based video of future states. Cosmos Transfer bridges simulation and reality. Cosmos Reason adds chain-of-thought planning for physical tasks. The newest addition, Cosmos Policy, announced in February 2026, fine-tunes the prediction model to directly output robot actions, achieving state-of-the-art performance on manipulation benchmarks according to NVIDIA. Adopters span humanoid robotics (1X, Agility, Figure AI, Skild AI) and autonomous vehicles (Uber, Waabi).

The telling detail from this road: NVIDIA’s Newton physics engine, codeveloped with Google DeepMind and Disney Research in September 2025, can simulate a character walking through snow or gravel, handling deformable objects that change shape under contact. This is the frontier of what simulation can encode: not rigid objects bouncing off each other, but the messy, yielding physics of the real world.

The strength of simulation is control: infinite data, perfect resets, no risk of breaking hardware. The weakness is the boundary of the known. You can simulate a factory floor you have already modeled. You cannot simulate a kitchen you have never seen. Every robot trained purely in simulation must eventually cross the sim-to-real gap, and the crossing is never free.

The Cinematographer’s Road: Physics from Pixels

The question this tradition asks is the most seductive in the field: if the internet contains billions of hours of video showing objects falling, water flowing, hands grasping, and cars turning, can a model trained on all that video learn how the physical world works?

When OpenAI’s Sora generated video with correct shadows, reflections, and physical interactions in early 2024, it looked like the answer might be yes. Google DeepMind’s Genie 3, released in August 2025, pushed further, generating navigable 3D environments at 24 frames per second from text prompts. Users could type a description and walk through a generated world that maintained visual consistency for minutes. World Labs, founded by Fei-Fei Li, shipped its commercial product Marble and in February 2026 raised $1 billion from investors including NVIDIA, AMD, and Autodesk.

But data volume does not resolve the core question: whether seeing is understanding. A video model can generate a plausible next frame. “Plausible” is not “correct.” The model optimizes for pixel-level prediction, which means it learns to produce images that look right. Whether it has learned the causal structure underneath, whether it knows why unsupported objects fall rather than that they fall in videos, remains genuinely unresolved. A 2025 benchmark paper found what the authors called “striking limitations” in vision-language models’ basic world-modeling abilities, including near-random accuracy on motion trajectory tasks.

More critically, these models cannot act. They predict what will happen next. They cannot predict what will happen next if I do something. This gap between observation and agency is what separates a video generator from a world model that can control a robot.

The Robot’s Road: Learning from the Real World

Every other tradition on this map learns about the physical world at one remove: through imagination, simulation, video, or abstract representation. The Robot’s Road insists that physical intelligence requires physical experience.

Physical Intelligence, founded in 2024 by Chelsea Finn, Sergey Levine, and Karol Hausman from Stanford and Google’s robotics teams, embodies this bet most directly. Their model, pi-zero, was trained on over 10,000 hours of robot data across seven configurations and 68 tasks. By November 2025, pi-zero 0.6 introduced reinforcement learning from experience using RECAP, a method that trains through demonstration, coaches through corrections, and improves from autonomous practice. The company raised $600 million, with CapitalG leading the round.

The vivid proof of concept: pi-zero 0.6 inserting a filter into an espresso machine, folding previously unseen laundry, assembling cardboard boxes. Throughput doubled on these tasks compared to earlier versions. Failure rates decreased over sustained operation. These are not spectacular demos. They are commercially relevant manipulation tasks performed reliably enough to measure in throughput, not just success rate.

Google DeepMind’s Gemini Robotics models followed a parallel path, and the Open X-Embodiment collaboration assembled data from 22 robot types across 21 institutions, creating the largest cross-robot dataset in the field.

This road has something no other tradition can claim: its models have controlled physical robots doing useful work. The bottleneck is not architectural. It is economic. Every hour of robot data requires a physical robot, a physical environment, a human operator, and the risk of hardware damage. Ten thousand hours is a massive investment that produces a tiny dataset compared to what video or simulation can generate.

The Architect’s Road: Predicting in Abstract Space

Yann LeCun has argued for years that every other tradition is making the same fundamental error. Generative models, whether they predict pixels, video frames, or physics states, try to reconstruct the future in detail. This includes models that predict in latent space, like Dreamer: they may discard pixel-level detail, but they still couple representation learning to action and reward from day one. The model learns to represent whatever helps the agent score higher. LeCun’s claim is that this coupling is backwards. Learn the representation first, without reference to any task or reward. Then plug in action and planning as separate modules on top.

His alternative is JEPA: Joint Embedding Predictive Architecture. LeCun’s full blueprint, laid out in his 2022 position paper, does include action conditioning: a world model module that takes a state representation and an action, and predicts the next state representation. In that sense, the complete architecture can answer “if I do X, what happens next,” just like Dreamer. The difference is not whether action enters the picture, but when. In Dreamer, representation and action-conditioning are learned jointly, shaped by reward from the first training step. In JEPA, the representation is learned first through self-supervised prediction on raw sensory data, with no actions and no rewards. Only after the representation is established does the architecture introduce action-conditioned planning as a separate layer.

The implementations built so far reflect only the first phase. V-JEPA and V-JEPA 2 are purely self-supervised: trained on video, they mask a region of space-time and predict the representation of the masked region from the visible context. No action input, no reward signal. V-JEPA 2, published in 2025, trained on over one million hours of internet video and demonstrated that this action-free representation could adapt to robot arm tasks with limited robot-specific data. The action-conditioned planning layer that LeCun’s blueprint calls for has not yet been built. This is both the Architect’s Road’s promise and its risk: the first phase looks viable, but the complete architecture remains theoretical.

In December 2025, LeCun left Meta after twelve years to found AMI Labs in Paris. In March 2026, AMI closed a $1.03 billion seed round at a $3.5 billion pre-money valuation, nearly doubling the 500 million euros it initially sought. The round was co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, with NVIDIA and Temasek among the strategic backers. That NVIDIA invested is worth noting: the company that built the Physicist’s Road is also placing a bet on the Architect’s. The team LeCun assembled reflects the ambition: Saining Xie, formerly of Google DeepMind, as chief science officer; Michael Rabbat, formerly of Meta FAIR, as VP of world models; Laurent Solly, Meta’s former VP for Europe, as COO.

His critique of the field at Davos in January 2026 was pointed: companies building humanoid robots produce impressive demos, all precomputed, and none has any idea how to make those robots smart enough to be useful. CEO Alexandre LeBrun was equally direct about the timeline: AMI Labs starts with fundamental research, not products. This is not the typical applied AI startup. It could take years for world models to go from theory to commercial applications. LeBrun’s other prediction carries a different kind of edge: “World models” will be the next buzzword, and within six months every company will claim the label to raise funding.

The gap between promise and proof is the widest on this road. No JEPA-based system has controlled a robot doing commercially useful work. AMI Labs has four hubs, over a billion dollars, and LeCun’s intellectual framework. What it does not yet have is a product. The theoretical elegance is real. The engineering validation is still ahead.

What the Map Reveals

Five traditions, organized by their core assumptions rather than by company or chronology, make patterns visible that no single-tradition analysis can show.

The scale-groundedness inversion. The Cinematographer’s Road has billions of hours of training data. The Robot’s Road has thousands. Yet the Robot’s Road controls actual robots, and the Cinematographer’s Road does not. This is not a temporary lag. It reflects a structural tension: the data that is cheapest to collect, internet video, is furthest from the task that matters most, physical manipulation. If scaling video data produced manipulation capability, the Cinematographer’s Road would have won already.

The definitional fracture. Each tradition defines “world model” differently. For the Dreamer’s Road, it means an action-conditioned simulator that predicts what happens when you act. For the Physicist’s Road, a physics engine. For the Cinematographer’s Road, a video prediction model. For the Robot’s Road, an action-conditioned policy trained on physical experience. For the Architect’s Road, a task-independent representation of how the world works, learned before any action or reward enters the picture. The Dreamer and the Architect both operate in abstract spaces, which makes them look similar from a distance. The difference is foundational: one builds a simulator driven by reward, the other builds a representation that precedes any task. These are not implementations of the same concept. They are different concepts sharing a name. This fracture means “progress in world models” refers to five things simultaneously, making it nearly impossible to assess the field’s trajectory without a framework like this one.

Selective, asymmetric convergence. Techniques are flowing between traditions, but the flow is directional. Video pretraining is entering robot learning: Dreamer 4 absorbs knowledge from unlabeled video, and Cosmos uses video generation to create robot training data. Physics priors are entering video models: Cosmos Reason adds physical reasoning to video prediction. But robot data is not flowing into video generation. JEPA’s abstract representations have not been integrated into any other tradition. The convergence is real but selective, producing two or three hybrid approaches rather than one unified method.

The transfer question decides the winner. Can a world model trained on internet video transfer useful knowledge to physical manipulation? If yes, the Cinematographer’s Road wins by data volume. Billions of hours of video trump thousands of hours of teleoperation. If no, if video understanding and manipulation understanding require fundamentally different representations, the Robot’s Road wins by relevance, and the Architect’s Road becomes the critical bridge. Dreamer 4’s ability to learn from offline video and then act is the strongest evidence yet that transfer is possible. But it transfers within a single domain (Minecraft), not across the video-to-robot boundary. That boundary remains uncrossed.

Institutional capital reveals uncertainty. Physical Intelligence raised $600 million for the Robot’s Road. LeCun raised $1.03 billion for the Architect’s Road. World Labs raised $1 billion for the Cinematographer’s Road. NVIDIA invested billions in its own Physicist’s and Cinematographer’s infrastructure, then also backed both AMI Labs and World Labs, hedging across traditions with its own capital. If any tradition had clearly won, capital would have concentrated into it. Instead it has spread across all five roads. The distribution of investment is itself a map of the field’s unresolved questions.

What the framework does not capture. This map organizes by research tradition, which illuminates intellectual lineage and core assumptions. It does not capture speed of iteration, quality of engineering teams, or the role of proprietary datasets that are invisible to outside observers. It also treats each tradition as relatively coherent, when in practice there are significant internal disagreements within each road. The Cinematographer’s Road alone spans pure video generation (Sora), interactive 3D worlds (Genie 3), and spatial intelligence (World Labs), approaches that share a data source but diverge on architecture and ambition. The map is a simplification that earns its keep by making the five-way definitional fracture visible. It is not the territory.

What This Means

Return to the three figures from the opening. Hafner built an agent that learns physics by watching and then imagining. LeCun left a twelve-year position to bet that Hafner’s entire paradigm, and everyone else’s, is solving the wrong problem. NVIDIA’s developers downloaded Cosmos two million times because they believe neither pure imagination nor pure abstraction will work without engineered physics as a foundation.

Each of them is solving a piece of a problem that no one has solved whole. The most likely future is not that one road wins and four lose. It is that the roads continue to merge selectively, producing hybrid systems that combine video pretraining with robot fine-tuning, simulation with learned dynamics, and possibly, if LeCun’s bet pays off, abstract representation bridging all of the above.

For anyone making decisions in this space, the framework matters more than the leaderboard. The question is not “who is ahead.” It is: which assumptions about how machines should understand the physical world will prove correct, and which systems will combine the right ones? That question has at least five credible answers. It does not yet have a consensus.

This landscape maps the five research traditions building world models. For a deeper look at the intellectual history, technical foundations, and open questions behind each tradition, see the six-part series Roads to a Universal World Model.

Robonaissance

Discussion about this post

Ready for more?