Roads to a Universal World Model, Part 6: The Convergence
Where the roads meet, and what we find there
"If the organism carries a 'small-scale model' of external reality and of its own possible actions within its head, it is able to try out various alternatives, conclude which is the best of them, react to future situations before they arise." — Kenneth Craik, The Nature of Explanation, 1943
In the final weeks of 2025, Yann LeCun left Meta after twelve years to launch a startup called Advanced Machine Intelligence Labs. The company had no product, no revenue, and no public demo. It was seeking a $3.5 billion valuation. Its mission: build world models using the JEPA architecture that LeCun had spent years developing inside Meta’s FAIR lab. The pitch was that large language models are structurally incapable of understanding the physical world, and that the next generation of AI would need to predict reality, not just text.
Around the same time, Fei-Fei Li’s World Labs shipped Marble, a system that generates explorable 3D worlds from text and images. NVIDIA announced that its Cosmos world foundation models had been downloaded over two million times. Google DeepMind released Genie 3, the first real-time interactive world model running at 24 frames per second. Physical Intelligence continued scaling its robotics foundation models. The term “world model” appeared in investor pitch decks with the frequency that “large language model” had two years earlier.
Something had converged. Five research traditions that had developed independently, each with different assumptions and different architectures, were suddenly landing on the same problem at the same time. But the convergence was not as clean as it appeared.
The Shared Ingredients
Trace the technical details of each road, and a pattern emerges. Despite starting from different places, all five traditions converged on the same small set of ingredients.
The first is the transformer. The Dreamer lineage moved from recurrent networks to transformer-based world models. Video generation models are built on diffusion transformers. Vision-language-action models are transformers from end to end. Even JEPA uses transformer encoders. The attention mechanism that revolutionized language processing turns out to be equally powerful for learning the structure of the physical world. This is not because transformers are uniquely suited to physics. It is because they are uniquely good at learning patterns in high-dimensional data, and physics, as captured by sensors, is high-dimensional data.
The second ingredient is internet video. Dreamer 4 learned to play Minecraft by watching YouTube videos it never acted in. V-JEPA 2 pretrained on over a million hours of internet video before seeing a single robot. Video generation models train on the same vast pools of online footage. NVIDIA’s Cosmos uses video alongside simulation data. The explosion of video on the internet created something no one planned: the largest corpus of implicit physics lessons ever assembled. Every clip of a ball bouncing, a door swinging, a liquid pouring contains information about how the world works. No one labeled it. No one annotated the forces involved. But the information was there, and by the mid-2020s, models were powerful enough to extract it.
The third ingredient is action conditioning. This is what separates a world model from a movie. The Dreamer’s Road understood this from the start: prediction conditioned on intervention is fundamentally different from passive prediction. The Robot’s Road arrived at the same conclusion from the opposite direction, discovering that vision-language models without action grounding could recognize a coffee cup but could not predict what would happen if you tipped it. V-JEPA 2 added action conditioning as a second training stage. Even video generation models, originally passive, are being retrofitted with action inputs to enable interactive simulation. The field has converged on a simple truth that Kenneth Craik articulated in 1943: a useful model of the world must let you ask “what happens if I do this?”
The fourth ingredient is multi-level prediction. Part 5 argued that the pixel-versus-representation debate would likely resolve as “both, at different levels.” The convergence supports this. The most capable systems emerging in 2025 and 2026 combine concrete, high-fidelity prediction at the sensory level with abstract, compressed prediction at higher levels. The brain does this. The question is whether current architectures can learn to do it without being explicitly designed that way.
Four ingredients. Transformers, video data, action conditioning, multi-level prediction. Every road discovered them independently. The convergence is real.
But real convergence on ingredients is not the same as convergence on a solution.
The Definition Problem
In January 2026, if you searched for “world model” across the websites of the major AI labs, you would find at least five distinct meanings.
For NVIDIA, a world model is a physics-aware video generator that produces synthetic training data for robots and autonomous vehicles. For World Labs, it is a system that creates explorable 3D environments from sparse inputs. For Google DeepMind, it is an interactive simulator that generates novel game-like worlds in real time. For Physical Intelligence, it is a policy backbone that predicts the consequences of robotic actions. For AMI Labs, it is a latent prediction architecture that learns abstract physical relationships without generating pixels.
These are not minor terminological differences. They reflect fundamentally different bets about what matters most. A physics-aware video generator and a latent prediction architecture share a name, but they share little else. They are optimized for different objectives, evaluated on different metrics, and aimed at different applications.
The field has no shared benchmark. This is not an accident. It is a structural problem. For language models, the evaluation question was relatively straightforward: does the model produce correct, coherent, useful text? Benchmarks like MMLU, HumanEval, and eventually Chatbot Arena provided imperfect but widely accepted yardsticks. For world models, the evaluation question is not even agreed upon. Should a world model be evaluated on the visual quality of its predictions? On their physical plausibility? On whether a robot using the model can complete a task? On whether the model can generalize to environments it has never seen?
The research community is starting to grapple with this. At CVPR 2025, an entire workshop was dedicated to world model benchmarks: WorldModelBench, WorldSimBench, WorldScore, each proposing different evaluation frameworks. Meta released three new benchmarks alongside V-JEPA 2, including IntPhys 2, which measures whether models can distinguish physically plausible from implausible scenarios. Humans scored between 85 and 95 percent on Meta’s tests. The best models lagged far behind. Separately, PAI-Bench evaluated state-of-the-art video generation models on physical reasoning tasks grounded in real-world footage. Its finding was stark: even models like Veo 3, capable of producing visually stunning output, consistently failed to obey basic physical laws or model complex real-world dynamics.
This is the field’s pre-ImageNet moment. Before Fei-Fei Li and her team created ImageNet in 2009, computer vision had no shared evaluation standard. Researchers developed clever algorithms, published papers, and argued about whose approach was better, with no common ground for comparison. ImageNet did not just provide a benchmark. It revealed which ideas actually worked, which ones merely seemed to work, and which scaling laws governed progress. The world model field needs something equivalent. Until it has one, progress will be measured in demos, not in science.
The Bets
The corporate landscape of early 2026 can be read as a map of which roads each organization believes will arrive first.
LeCun’s AMI Labs is the purest bet on the Architecture Debate. Its founding thesis is that generative models are solving the wrong problem, that prediction must happen in representation space, and that the path to physical understanding runs through JEPA, not through pixels. The $3.5 billion valuation before a single product reflects investor confidence in LeCun’s vision, or perhaps in his name. The risk is equally clear: JEPA remains largely unproven at scale, and the commercial applications are years away.
NVIDIA’s Cosmos is a bet on the Physicist’s Road merging with the Cinematographer’s Road. Simulation provides structure, learned video models fill the gaps, and the combination produces synthetic training data at a scale no real-world data collection could match. The two million downloads suggest the approach has traction, particularly in autonomous driving and industrial robotics. NVIDIA’s advantage is infrastructure: it sells the GPUs that every other approach requires.
World Labs and Google DeepMind’s Genie 3 are bets on the Cinematographer’s Road reaching interactive simulation. If video generation models can learn enough physics to produce explorable, consistent 3D worlds, then the road from entertainment to robotics training becomes surprisingly short. The commercial tailwind from gaming, film, and virtual reality funds the research. The risk is that visual fidelity without physical accuracy produces worlds that look right but behave wrong.
Physical Intelligence represents the Robot’s Road, betting that world models must be built from action upward, not from observation downward. Its foundation models learn from physical interaction, and the company argues that no amount of video watching can substitute for the constraints that real physics imposes. The limitation is data: robot interaction data remains orders of magnitude scarcer than internet video.
Each bet has a logic. Each logic has a gap. And without shared benchmarks, there is no way to know which gap will close first.
The Toddler, Revisited
This series opened with a fourteen-month-old pushing a cup off a kitchen table. She was not surprised when it fell. She did not need to see the cup fall to know it would fall. Somewhere in her developing brain, a model of the world predicted the outcome before it happened.
Six parts later, we can say with more precision what makes her world model so remarkable, and why it remains so far ahead of anything we have built.
Her model does not choose between pixels and abstractions. It operates at every level simultaneously: the raw sensory flood of color and shape, the compressed spatial understanding of where the cup is relative to the edge, the abstract causal knowledge that unsupported objects fall. She moves between levels without effort, without even knowing she is doing it.
Her model was not trained on a billion hours of video. It was built from a few months of reaching, grasping, dropping, and watching, all grounded in action. She did not learn physics by observing it. She learned physics by intervening in it. Every time she pushed an object and watched what happened, she was running an experiment and updating her model with the result.
Her model does not hallucinate. When she predicts that the cup will fall, the prediction is constrained by a physical understanding that no amount of statistical pattern matching in pixel space has yet achieved. She does not push a full cup the same way she pushes an empty one. She does not predict that the cup will hover because she once saw a cup-shaped balloon float. Her model separates the general from the specific, the causal from the coincidental, the physical from the visual.
And her model is always being updated. Every new experience refines it. Every surprise, every failed prediction, every object that behaves differently than expected provides a training signal. She is not a static system running inference. She is a learning system, running in the world, with the world as her training set.
The five roads this series has traced each capture a piece of what she does effortlessly. The Dreamer’s Road captures imagination: predicting futures to plan actions. The Physicist’s Road captures the structure of known physics. The Cinematographer’s Road captures the richness of visual experience. The Robot’s Road captures the necessity of action. The Architecture Debate captures the question of at what level understanding should be represented.
No single road has built what she has. The question is whether their convergence might.
What We Know, What We Don’t
The honest assessment is this.
We know that world models work. Dreamer agents learn to play games by imagining. Video models learn implicit physics from pixels. Robots plan actions using latent predictions. The concept that Craik proposed in 1943 and that seemed perpetually out of reach for decades is now producing real, if limited, results across every road.
We know the ingredients. Transformers, internet video, action conditioning, multi-level prediction. These are no longer speculative. They are the shared foundation of every serious effort.
We do not know how to evaluate progress. The absence of shared benchmarks means we cannot distinguish genuine advances from impressive demos. This is not a minor gap. It is the gap that determines whether the field develops as science or as marketing.
We do not know which architecture will win, or whether the question even makes sense. The pixel-versus-representation debate may resolve as a spectrum, not a binary. The toddler does not choose.
We do not know the timeline. LeCun says human-level AI built on world models will take years of fundamental research. The venture capital ecosystem has priced in something faster. One of them is wrong. History suggests it is the money.
And we do not know whether the convergence will produce a universal world model or whether “universal” is the wrong goal. Biology did not produce one general-purpose prediction system. It produced a hierarchy of specialized systems, from the spinal reflexes that operate in milliseconds to the prefrontal cortex that plans over months, all integrated into something that functions, from the outside, as a single coherent understanding of the world.
The roads are converging. The destination is becoming visible. But the last mile may be the longest, because it requires not just better models but a better understanding of what understanding itself means.
This concludes the “Roads to a Universal World Model” series. For earlier parts: Part 0, “The Map”; Part 1, “The Dreamer’s Road”; Part 2, “The Physicist’s Road”; Part 3, “The Cinematographer’s Road”; Part 4, “The Robot’s Road”; Part 5, “The Architecture Debate.”


