Roads to a Universal World Model, Part 4: The Robot’s Road
The robotics-native path: learning through touch and action
“The real robot data is the human fuel. It’s worse than fossil fuel. You’re burning human fuel.” — Jim Fan, NVIDIA
In 2023, Google DeepMind published a study showing that their RT-1 robot had learned to perform over 700 manipulation tasks by watching 130,000 human demonstrations. The demonstrations had been collected using 13 robots over 17 months in an office kitchen environment. Seventeen months. Thirteen robots. One kitchen.
For context: GPT-3 was trained on roughly 300 billion tokens of text scraped from the internet in weeks. Sora was trained on millions of hours of video. The entire written and filmed record of human civilization was available as training data. But robot data, the continuous joint-angle signals and force measurements that describe how a physical body interacts with the physical world, does not exist on the internet. Every trajectory must be collected by a human teleoperating a physical robot, one demonstration at a time.
This is the defining constraint of the Robot’s Road. The physicist can simulate worlds from equations. The cinematographer can learn from the planet’s video archives. The robot has almost nothing. And what little it has took months of painstaking labor to collect.
From Kitchen to Knowledge
The breakthrough idea was to borrow. Not robot data, which remained scarce, but knowledge.
In July 2023, Google DeepMind released RT-2, the first vision-language-action model. The recipe was simple in concept and radical in implication: take a vision-language model already trained on billions of images and text from the web, and fine-tune it on robot demonstration data. Represent robot actions as text tokens. Treat motor commands as just another language.
The results were striking. RT-2 could interpret commands it had never seen in robot training data. Asked to pick up an object that could serve as an improvised hammer, it chose a rock. Asked to identify which drink would help a tired person, it selected an energy drink. These were not behaviors learned from robot demonstrations. They were inherited from the vision-language model’s knowledge of the world: knowledge about what rocks are, what hammers do, what tiredness means. On novel, unseen tasks, RT-2’s success rate nearly doubled compared to its predecessor, from 32% to 62%.
The principle was powerful. The web contains vast knowledge about objects, their properties, their relationships, their uses. A model pre-trained on this knowledge and then fine-tuned on robot data could transfer that semantic understanding into physical action. The robot did not need to learn from scratch that a Coca-Cola can is graspable. It inherited that knowledge from the internet.
The Generalist Dream
If RT-2 proved the principle, Physical Intelligence’s π0 pushed it toward a vision.
In October 2024, a team led by Sergey Levine and Chelsea Finn, both veterans of the effort to make robots learn from data rather than from hand-coded rules, unveiled π0: a generalist robot policy trained on over 10,000 hours of demonstration data from seven different robot platforms and 68 distinct tasks. The model could fold laundry from a hamper, assemble cardboard boxes, clear a table, and stack eggs. It could do these things on robots it had never controlled during training, with objects it had never seen, in configurations that differed from its demonstrations.
This was not a laboratory curiosity. Physical Intelligence, founded in 2023 and valued at $2.4 billion by early 2025, was making an explicit bet: that generalist robot policies would do for physical manipulation what large language models had done for text. Just as GPT could write code, summarize documents, and answer questions because it had been trained on enough diverse text, π0 aimed to fold shirts, load dishwashers, and bag groceries because it had been trained on enough diverse robot experience.
The architecture reflected this ambition. π0 was built on a pre-trained vision-language model, inheriting its semantic knowledge. But instead of generating text tokens for actions, it used flow matching to produce smooth, continuous motor commands at 50 Hz. The distinction mattered. Robot actions are not discrete symbols like words. They are continuous trajectories through joint space, and representing them as language tokens was a lossy approximation. Flow matching treated actions as what they are: continuous signals that must be executed in real time.
By April 2025, the team had released π0.5, which could control a mobile manipulator to clean up kitchens and bedrooms in homes it had never seen. The tasks took 10 to 15 minutes. The robot made mistakes. It sometimes opened and closed the same drawer repeatedly, or failed to grasp an unfamiliar handle. But it was operating in genuinely novel environments, figuring out what needed to be done and doing it, without any environment-specific training. Researchers at the University of Pennsylvania who tested π0 independently reported that even at a 20 to 50 percent success rate on simple tasks out of the box, the result marked a major leap: for the first time, you could download someone else’s robot controller, load it onto your own hardware, and watch it do something useful.
The Misalignment
Then, in December 2025, NVIDIA’s Jim Fan published a critique that struck at the foundation of the entire VLA approach.
Fan’s argument was structural. VLA models graft an action module onto a pre-trained vision-language model. But vision-language models are optimized for benchmarks like visual question answering. This creates two problems. First, most of the parameters in these models serve language and knowledge, not physics. The billions of weights that allow a VLM to recognize a Coca-Cola can and describe its properties are not the same weights needed to predict what happens when you tip it. Second, the visual encoders in these models are actively trained to discard low-level details, because high-level understanding is all that question answering requires. But for a robot performing dexterous manipulation, low-level details are everything: the exact position of a fingertip, the angle of approach, the texture of a surface.
The implication was sharp. If the pre-training objective is misaligned with the needs of robotic control, then scaling up the VLM backbone will not proportionally improve the robot’s physical capabilities. More parameters will encode more knowledge about the world’s semantic structure. They will not encode more knowledge about its physical dynamics. Fan argued that video world models, which learn temporal dynamics and physical regularities from pixels, are a more natural pre-training objective for robots than language models trained to answer questions about images.
This critique illuminates a tension at the heart of the Robot’s Road. The VLA approach works because it transfers knowledge. RT-2 succeeded precisely because it knew what rocks and hammers and energy drinks were. π0 succeeded because it understood what “clean the kitchen” meant. But knowledge about what things are is different from knowledge about how things behave. Recognizing a glass and predicting what happens when you flick it are different computational problems, and a model optimized for the first may be poorly suited for the second.
The Hard Problem of Contact
The frontier where this distinction matters most is contact.
Every other road to world models can avoid or approximate contact dynamics. The physicist’s simulator can model contact, but only by writing increasingly complex equations that still fail to capture deformable materials, granular media, or the subtle interplay of friction and inertia that governs everyday manipulation. The cinematographer’s video model can generate convincing images of hands grasping cups, but the moment of contact, where force transfers, objects deform, and physics becomes three-dimensional rather than visual, is precisely where video models break down.
The robot cannot avoid contact. Contact is the point. A robot that picks up a cup must model the friction between its fingers and the ceramic, the weight distribution that determines whether the cup tips, the force required to lift without crushing. A robot that folds laundry must model the way fabric drapes, bunches, slides, and catches. These are not edge cases. They are the central problem.
This is why robotics imposes the hardest test on world models. A video model can fake gravity by generating pixels that look like falling. A robot cannot fake gravity. A simulator can approximate friction with a coefficient. A robot that applies the wrong amount of force drops the cup or shatters it. The consequences of prediction error in robotics are not visual artifacts. They are broken objects, failed tasks, and damaged hardware.
The result is a severe selection pressure. Only world models that genuinely capture the physics of interaction survive contact with reality. This makes the Robot’s Road slower than the others, less glamorous, less amenable to breathtaking demos. But it also makes it the most honest test of whether a world model actually works.
Prediction Conditioned on Action
What fundamentally separates the Robot’s Road from the others is not data scarcity, or contact dynamics, or the VLA debate. It is a conceptual distinction that the other roads can defer but this one cannot.
A video model predicts what happens next. A robot’s world model must predict what happens next if I do this.
The difference seems small. It is not. Unconditional prediction, forecasting the next frame of a video, can succeed by learning the statistics of natural scenes. The sun continues to set. The ball continues to fall. The pedestrian continues to walk. But action-conditioned prediction requires modeling the consequences of intervention. What happens to the cup if I push it? What happens to the fabric if I pull here instead of there? What happens to the tower of blocks if I remove this one?
This is the distinction between a movie and a model. A movie shows you what the world does. A model lets you ask what the world would do under different actions. Planning, the ability to evaluate multiple possible actions and choose the best one, requires action-conditioned prediction. And planning is what makes a robot useful.
Hafner’s Dreamer, back on the Dreamer’s Road, understood this from the start: the whole point of learning a world model was to plan inside it. But Dreamer operated in simplified environments, game worlds and simulated tasks, where the dynamics were clean and the state spaces were manageable. The Robot’s Road demands action-conditioned prediction in the full complexity of the real world: cluttered kitchens, novel objects, deformable materials, partial occlusion, and forces you can feel but not see.
This is why, despite Fan’s critique, the VLA approach has traction. In the absence of a world model that can predict the consequences of actions in rich physical environments, the field has settled for an approximation: train a policy that maps observations directly to actions, borrowing semantic knowledge from language models to fill the gaps. The world model is implicit in the policy, baked into its weights, never explicitly queried or planned against. It works, partially, for the tasks and environments covered by the training data. Whether it will scale to the open world remains the central question.
The Road Ahead
The Robot’s Road has achieved things that seemed impossible five years ago. A generalist policy can fold laundry, clean an unfamiliar kitchen, and generalize across robot platforms. But the field’s honest practitioners acknowledge how far the road still stretches.
Success rates hover between 20 and 60 percent on novel tasks. Every robotics demo is cherry-picked from dozens of attempts. There are no agreed-upon benchmarks, no standard evaluation protocols. Hardware reliability limits software iteration speed: a robot that breaks its gripper every few hours cannot collect the data needed to improve. The data problem has not been solved, only managed.
And the deepest question remains open. The VLA approach transfers semantic knowledge effectively: a robot that knows what a kitchen is can navigate one. But semantic knowledge is not physical knowledge. Knowing that cups are fragile does not tell the robot how much force will break one. The Robot’s Road needs world models that combine the semantic understanding of language models with the physical understanding that, so far, only comes from interaction.
The next part of the series steps back from the individual roads to ask the question they all converge on. If we need a world model that understands physics, responds to actions, predicts at multiple time horizons, and generalizes across environments, what should that model actually look like? Should it predict pixels, or something more abstract? Should it be a single monolithic network, or a modular architecture? Should it learn everything from data, or encode known physics as structure?
These are not engineering details. They are the architecture of understanding itself.
Next: Part 5, “The Architecture Debate,” examines the deepest question in world modeling: should a universal world model predict pixels or predict in a more abstract space? This is where Plato’s cave resurfaces as a live technical argument.


