Roads to a Universal World Model, Part 5: The Architecture Debate
What should a world model even look like?
"How could they see anything but the shadows if they were never allowed to move their heads?" — Plato, Republic, Book VII
In the summer of 2022, while the rest of the AI world was still absorbing the implications of large language models, Yann LeCun published a document that read less like a research paper and more like a manifesto. Titled “A Path Towards Autonomous Machine Intelligence,” the 62-page position paper laid out an architecture for machines that could learn, reason, and plan like animals. At its center was a claim that put LeCun at odds with nearly every major AI lab: generative models are the wrong path to understanding.
The argument was precise. Models that predict pixels, tokens, or any raw sensory output waste enormous capacity on details that do not matter. The exact pattern of light reflecting off a cup of coffee tells you almost nothing about whether the cup will tip if you bump the table. A model that devotes its parameters to reconstructing those pixel-level details is, in LeCun’s framing, solving the wrong problem. It is predicting the shadows on the cave wall instead of learning the shapes of the objects casting them.
The alternative he proposed was the Joint Embedding Predictive Architecture: JEPA. Instead of generating outputs, JEPA would predict in an abstract representation space. Two encoders would transform inputs into compact embeddings, and a predictor would forecast future embeddings from current ones. The architecture would learn to represent the predictable structure of the world while ignoring the unpredictable noise. No pixel reconstruction. No token generation. Prediction at the level of meaning, not the level of sensation.
It was an elegant idea. It was also, in 2022, almost entirely theoretical.
Shadows and Objects
The philosophical intuition behind JEPA maps onto an argument that is 2,400 years old. In Plato’s allegory of the cave, prisoners chained to a wall see only shadows of objects projected by firelight. They mistake the shadows for reality. The philosopher who escapes the cave and sees the actual objects gains a deeper understanding, one that lets him predict not just the shadows’ shapes but why they have those shapes.
LeCun’s argument against generative models is, at its core, a cave argument. A video generation model trained to predict pixels is learning to predict shadows. It can become extraordinarily good at this. It can learn that shadows grow longer in the evening and shorter at noon. It can learn that when one shadow vanishes, others shift. But it has no representation of the object casting the shadow, no model of the light source, no understanding of the geometry that connects them. If you change the angle of the light, the shadow-predictor breaks. The object-knower adapts.
This is a powerful metaphor. It is also the kind of argument that can be true in principle and wrong in practice. The history of AI is littered with architectures that were theoretically elegant but lost to brute-force approaches that simply scaled better. The question is not whether LeCun’s intuition is correct. The question is whether reality cares.
Nature’s Proof
There is, however, one piece of evidence that is difficult to dismiss. Nature built a world model, and it looks nothing like a language model.
Roughly a third of the human cortex is devoted to processing visual information. Language, by contrast, occupies comparatively compact neural territory: Broca’s area and Wernicke’s area together constitute a small fraction of total cortical volume. The brain’s world model is overwhelmingly visual and spatial, not linguistic.
The primate evidence makes this even starker. Great apes possess physical intelligence that exceeds anything current robots can achieve. Orangutans use tools, build shelters, and solve mechanical puzzles with a patience and ingenuity that researchers find remarkable. In captivity, they have been observed washing clothing, using hammers, and operating simple vehicles. They accomplish all of this with language capacity that, if translated into AI terms, would be roughly equivalent to the earliest, most primitive language models. Their world models are rich, grounded, and physical. Language plays almost no role.
This is nature’s existence proof that a robust world model can be built primarily from visual and physical experience, with minimal linguistic scaffolding. It does not prove that JEPA is the right architecture. But it does suggest that the current AI paradigm, which routes most physical understanding through language-heavy models, may have the proportions backwards.
From Theory to Test
For three years after LeCun’s position paper, JEPA remained more vision than validation. The first implementation, I-JEPA, showed that image models trained to predict in representation space could learn surprisingly good visual features without any pixel reconstruction. But images are static. The real test was video, which is where dynamics, physics, and prediction over time all live.
V-JEPA, released in early 2024, extended the architecture to video. The model learned by predicting masked regions of video in latent space, filling in what it expected to see based on context, but doing so at the level of abstract features rather than pixels. The results were striking. V-JEPA was the first video model that excelled at “frozen evaluations,” meaning the core encoder could be trained once and then applied to new tasks by adding only a lightweight classifier on top, without retraining the entire model. It was efficient. It was general. But it could not act.
V-JEPA 2, published in June 2025, closed the gap. The architecture remained the same: self-supervised learning from video, predicting in representation space, no pixel generation. But after pretraining on over one million hours of internet video, the team added a second stage. They trained an action-conditioned predictor, V-JEPA 2-AC, using just 62 hours of unlabeled robot data from the Droid dataset. The result was a latent world model that could plan: given a current image and a goal image, it searched for a sequence of actions that would minimize the distance between the predicted future state and the goal, all in representation space. No pixel generation. No reward function. No task-specific training.
They deployed it zero-shot on robot arms in two different labs, neither of which had been seen during training, and it could pick up and place objects using only a single uncalibrated camera. Planning took 16 seconds per action. For comparison, a baseline built on NVIDIA’s Cosmos video generation model took four minutes.
The numbers were modest. The tasks were simple. But the proof of concept was clear: a model that never generated a single pixel could plan physical actions in the real world.
The Pragmatist’s Rebuttal
The case for JEPA is intellectually compelling. It is also, at the moment, incomplete.
Video generation models can do things that JEPA models cannot yet match. They can simulate complex environments at high fidelity. They can generate diverse scenarios for testing robot policies. They can produce the kind of rich, detailed rollouts that visuomotor policies need for closed-loop evaluation: if a robot’s policy takes images as input, the world model evaluating that policy must produce images as output. A latent-space model that predicts abstract embeddings cannot directly feed a policy that expects pixels, unless the policy itself is redesigned for latent-space inputs. This creates a practical pull toward pixel prediction even when the theoretical arguments favor abstraction.
There is also the question of momentum. As the Cinematographer’s Road showed, video generation develops under massive commercial pressure from entertainment, advertising, and social media. Every dollar spent improving tools for TikTok and Hollywood also improves the implicit physics engines inside video models. JEPA has no comparable commercial tailwind. Its development depends on research labs with long time horizons, primarily Meta’s FAIR, betting on a paradigm that may take years to mature.
The pragmatist’s argument is simple: video models work now. They scale with data and compute in ways we understand. They are imperfect, but they improve predictably. JEPA is a better theory that may or may not become a better practice.
The Space Between
The deepest question in this debate is not which architecture wins. It is what kind of understanding machines need.
Consider two ways a model might represent the fact that a ball rolling off a table will fall to the floor. A video model represents this as a pattern in pixel space: in training data, objects near table edges tend to appear lower in subsequent frames. The representation is implicit, distributed across millions of parameters, and entangled with irrelevant details like the color of the table and the lighting in the room. A latent model might represent the same fact as a compact relationship between embeddings: the “edge state” embedding maps to the “falling state” embedding via a learned transition, independent of visual details.
The video model’s representation is brittle but rich. Change the lighting and the prediction still works, because the model has seen many lightings. Replace the plastic ball with a lead one of the same size and color, and the prediction fails, because the pixels are nearly identical but the physics are completely different: the lead ball rolls slower, falls faster, and dents the floor on impact. The information that distinguishes the two, mass, is not in the pixels. The latent model’s representation is potentially more robust but currently less proven. If it has learned the right abstractions, it might generalize where pixels cannot. If it has not, it fails just as badly.
This is the core uncertainty. Nobody knows, yet, whether prediction in representation space actually learns deeper physical structure than prediction in pixel space, or whether it simply learns the same correlations in a more compact form. V-JEPA 2’s results are encouraging but early. The model handles tabletop pick-and-place. It does not fold laundry, navigate kitchens, or manipulate deformable objects. The gap between current results and the theoretical promise remains wide.
What the Debate Reveals
The architecture debate matters because it determines what kind of world model is even possible. If pixels are the right level of prediction, then the path forward is clear: scale video models, improve their physics, solve their hallucination problems, and eventually they will converge on something that functions as a universal world model. Commercial incentives align with this path. The engineering challenges are formidable but understood.
If representation space is the right level, the path is less clear but potentially more powerful. A model that predicts at the level of abstract physical relationships could, in principle, plan over longer time horizons, transfer across different sensory modalities, and generalize to novel physical situations that no training video has captured. This is the promise LeCun articulated. It is also the promise that remains largely unfulfilled.
The most likely outcome, as with most grand debates in AI, is that the answer is both. Different timescales and different purposes may demand different levels of prediction. Short-horizon control, where a robot needs to know exactly what the next camera frame will look like, may favor pixel-level models. Long-horizon planning, where an agent needs to reason about whether a sequence of actions will achieve a goal minutes or hours in the future, may favor abstract representations. The architecture that dominates five years from now may not be a pure pixel predictor or a pure JEPA, but something that operates at multiple levels simultaneously: concrete at the bottom, abstract at the top, with learned interfaces between them.
This is, it turns out, roughly how the brain works. The visual cortex processes raw sensory data with extraordinary fidelity. Higher cortical areas compress that information into increasingly abstract representations. Planning and decision-making operate at the abstract level. Execution translates back down to the concrete level of motor commands. The brain does not choose between pixels and abstractions. It uses both, at different stages of the same process.
The architecture debate is not settled. But its shape is becoming clear. The question is no longer whether to predict in pixel space or representation space. The question is how to build systems that can do both, and know when to use which.
Next: Part 6, “The Convergence,” examines where all five roads meet and what a universal world model might actually look like when it arrives.


