Roads to a Universal World Model, Part 3: The Cinematographer’s Road
The generative video path: learning physics from pixels
“What I cannot create, I do not understand.” — Richard Feynman
On February 15, 2024, OpenAI published a technical report with a provocative title: “Video generation models as world simulators.”
The report introduced Sora, a model that could generate up to sixty seconds of high-fidelity video from a text prompt. The videos were striking: a woman walking through a snowy Tokyo street, reflections shimmering in puddles; a drone shot sweeping over Big Sur cliffs as waves crashed below. But what caught researchers’ attention was not the visual quality. It was what the model appeared to have learned without being told.
Objects persisted behind occluders. Shadows fell in the correct direction. When a camera panned, the three-dimensional structure of the scene was preserved. Liquids poured with something resembling fluid dynamics. A Minecraft world generated from a text prompt obeyed the game’s physics while simultaneously rendering its visual style. None of this had been programmed. No one had written equations for how shadows should fall or how liquids should pour. The model had been trained on video, and these behaviors had emerged from scale.
OpenAI’s claim was audacious: scaling video generation models might be a path toward building general-purpose simulators of the physical world. The physicist builds a simulator from equations. The cinematographer, apparently, could learn one from footage.
The Commercial Tailwind
The Cinematographer’s Road is paved not by robot labs but by the entire video economy.
This matters more than it might seem. Technologies built for one industry have repeatedly transformed robotics. Depth cameras developed for the Xbox Kinect ended up in warehouse robots. Inertial measurement units designed for smartphones found their way into drones. Large language models trained to predict text turned out to be useful for controlling robot arms. The pattern is consistent: when a technology matures under commercial pressure from a mass-market industry, robotics eventually inherits the results.
Video generation fits this pattern. The commercial pressure to produce better AI-generated video is immense and accelerating, driven by social media, film production, advertising, and entertainment. Sora competes with Runway, Pika, Luma, Google’s Veo, Kuaishou’s Kling, and ByteDance’s tools for a market projected to be worth billions. Each company pours engineering talent and compute into making video models more temporally consistent, more physically plausible, more controllable. The quality of generated video improves quarter by quarter, propelled by budgets that no robotics lab could justify.
The consequence for world models is indirect but powerful. Every dollar spent making a TikTok generation tool more realistic also makes its implicit physics engine more accurate. The Cinematographer’s Road is bankrolled by Hollywood and social media, not by the robotics community. This gives it a scaling advantage that the other roads lack.
What the Pixels Learned
The most surprising discovery of the video generation era is that neural networks trained to predict frames of video learn something that looks like physics.
This was not the original goal. Video generation models were built to produce compelling visual content. But to predict what comes next in a video, a model must implicitly represent how the world works. If a ball is thrown, the model must predict a parabolic arc. If a person walks behind a pillar, the model must predict their reappearance on the other side. If a cup is pushed off a table, the model must predict that it falls. These are not explicit rules encoded in the network. They are statistical regularities extracted from millions of hours of video depicting a world governed by physics.
The effect is cumulative. Early video generation models, circa 2022, produced short clips where objects morphed, physics broke, and consistency lasted only a few frames. By the time Sora appeared in early 2024, the models could sustain coherent three-dimensional scenes for up to a minute, with camera movements that preserved spatial structure, with lighting that responded to scene geometry, with object interactions that obeyed at least the broad strokes of Newtonian mechanics.
NVIDIA senior researcher Jim Fan, reacting to Sora’s release, argued that it was less a creative tool than a data-driven physics engine. The framing was deliberate. If video models could learn to simulate physics from observation alone, without equations, without explicit physical laws, then the entire simulation infrastructure of the Physicist’s Road might be supplemented or even replaced by a learned model trained on enough footage of the real world.
What the Pixels Did Not Learn
The celebration was premature.
OpenAI’s own technical report acknowledged that Sora could not accurately model many basic physical interactions. Glass did not shatter correctly. A person could bite a cookie and leave no bite mark. Objects sometimes appeared from nowhere or duplicated spontaneously. After about twenty seconds, consistency began to erode. After a minute, the generated world often bore only a loose resemblance to its opening frame.
These failures are not random. They follow a pattern. Video generation models fail in predictable ways that reveal the limits of learning physics from pixels alone.
The first failure mode is object impermanence. Objects that leave the frame or pass behind an occluder may fail to return, or may return altered. The model can generate convincing occlusion in the short term, but over longer horizons, its memory of what went behind the pillar degrades. The ball that rolled behind the couch may reappear as a different ball, or not reappear at all.
The second is spontaneous generation. New objects appear in scenes without cause. An extra chair materializes in a room. A second person appears in a shot that began with one. This reflects the model’s statistical nature: when the current frame is ambiguous about what should be present, the model samples from its distribution and sometimes draws an object that was not there before.
The third is physical inconsistency under interaction. The model can generate a convincing cup sitting on a table. It can generate a convincing hand reaching for a cup. But the moment the hand contacts the cup, the physics of the interaction breaks down: the transfer of force, the tipping, the liquid sloshing. Contact dynamics, the hard problem of the Physicist’s Road, is also the hard problem of the Cinematographer’s Road. The difference is that the physicist knows the equations are missing. The cinematographer’s model does not know what it does not know.
The fourth, and most revealing, is the inability to model failure. Prior to Sora 2, if a basketball player in a generated video missed a shot, the ball would sometimes teleport to the hoop anyway. The model had learned the statistical outcome, not the physical process. Videos of basketball highlights outnumber videos of missed shots. So the model learned that shots go in.
OpenAI acknowledged this in their September 2025 release of Sora 2, noting that the newer model could generate missed shots that rebounded realistically off the backboard. It was, they wrote, an important capability for any useful world simulator: the ability to model failure, not just success.
From Watching to Playing
The most consequential shift on the Cinematographer’s Road has been the move from passive generation to interactive worlds.
Google DeepMind’s Genie line traces this arc. The original Genie, announced in early 2024, was a foundation world model trained on internet videos that could generate simple playable 2D environments from a single image. Genie 2, revealed in December 2024, could generate photorealistic 3D worlds from an image prompt, complete with simulated physics, NPC behavior, and first-person navigation. By August 2025, Genie 3 could generate diverse interactive environments from text prompts alone, at 720p resolution, 24 frames per second, with consistency maintained for minutes rather than seconds.
In January 2026, Google released Project Genie as an experimental prototype available to subscribers. Users could type a text prompt, a cat exploring a living room, a car driving across a rocky moonscape, and navigate the resulting world in real time. Testers immediately pushed the boundaries, generating crude but functional recreations of Super Mario 64 and Breath of the Wild.
The quality was rough. Input lag was significant. Each world lasted only sixty seconds. But the conceptual leap was profound. A neural network, trained on video, was generating a navigable three-dimensional world that responded to user actions in real time. It was not rendering from a scene graph. It was not solving physics equations. It was predicting, frame by frame, what the world should look like next given what the user just did. Every frame was an act of imagination grounded in statistical regularity.
This is where the Cinematographer’s Road begins to converge with the Dreamer’s Road. Hafner’s dreamers learned world models to plan actions inside them. Genie’s worlds are generated, not planned in, but the underlying capability is similar: predict the next state of the world conditioned on an action. The question is whether this prediction is deep enough to be useful for real-world robots, or whether it remains a sophisticated visual trick.
Interpolation or Understanding?
The central question of the Cinematographer’s Road is whether predicting pixels constitutes understanding physics.
The optimistic case is strong. Video models have learned regularities that no one taught them: three-dimensional consistency, object permanence, basic dynamics. They improve with scale. The commercial pressure to make them better is enormous and will not stop. And pixels are a universal interface. Any physical phenomenon that can be observed can, in principle, be captured on video, which means a sufficiently powerful video model might learn an arbitrarily complete model of physical reality.
The pessimistic case is equally strong. Video models learn from the surface of things. They observe what happens but not why. A video model that generates realistic pouring does not represent the viscosity of water, the surface tension at the lip of a glass, or the gravitational acceleration pulling the stream downward. It has learned that “pouring looks like this.” The representation may be deep enough to interpolate convincingly within its training distribution. But when asked to extrapolate, to predict what happens in a novel situation it has not seen, the model may fail in ways a physicist’s equations would not.
The technical term is the difference between interpolation and extrapolation. Video models are superb interpolators. They have seen millions of examples of cups on tables, people walking through doors, balls bouncing on floors. Within these familiar scenarios, they predict the next frame with impressive accuracy. But the real world constantly presents novel situations. An object with an unfamiliar shape. A material with unusual properties. A collision that happens to combine forces in a way the model has never observed. In these moments, interpolation breaks down, and the model’s lack of explicit physical knowledge becomes visible.
The physicist who writes down Newton’s laws can predict the trajectory of any object under any gravitational field, including ones never observed. The cinematographer’s model can only predict trajectories that resemble ones it has seen in training data. This is not a difference in degree. It is a difference in kind. Equations generalize. Statistics interpolate.
Or do they? The honest answer is that we do not yet know where the boundary lies. As video models scale, the range of their effective interpolation expands. What looked like extrapolation at one scale turns out to be interpolation at a larger one. The question of whether there is a ceiling, a point beyond which statistical learning from pixels cannot go, is one of the open questions of the field. No one has proven such a ceiling exists. No one has proven it does not.
The Road Ahead
The Cinematographer’s Road has moved faster than anyone expected. In three years, video generation went from producing jittery five-second clips to rendering interactive 3D worlds navigable in real time. The commercial tailwind shows no sign of weakening. And the implicit physics learned by these models, flawed as it is, represents a form of world knowledge that no one had to write down.
But the road has not yet reached its destination. Video models hallucinate objects, violate physics under contact, and struggle with long-horizon consistency. They learn what the world looks like, which is not the same as learning how the world works. For a filmmaker, the distinction may not matter. For a robot that must grasp a cup without breaking it, the distinction is everything.
The next road starts from the robot’s end of the problem. Not from equations, not from pixels, but from touch and action. A world model that cannot be acted upon is, for a robot, merely a movie. What turns a movie into a model is the ability to condition predictions on interventions: not “what will happen next,” but “what will happen next if I do this.”
Next: Part 4, “The Robot’s Road,” follows the hardest path to world models, where real physics, real-time feedback, and real consequences for errors impose constraints that no simulator or video generator can avoid.


