Language Is Poison, Part 3: Everyone Is Heading to the Same Place

Five companies, five definitions, one term. The convergence on “world model” masks a deeper disagreement about what intelligence requires.

Mar 21, 2026

This deep-dive series explores a seven-hour interview with Saining Xie, hosted by Xiaojun Zhang in February 2026. To better convey the depth and ideas of the conversation, I have reorganized the narrative, added background context, and clarified some of the more technical points.

“World model is not a technique,” Saining Xie says. “It is not an algorithm. It is a purpose.”

He pauses, then adds: “Everyone, whether doing LLM or video diffusion or Gaussian Splatting, is on the road to world models. In a year or two, all these arguments will seem absurd, because we are all heading in the same direction, just thinking about the problem from different angles.”

If world model is a destination rather than a route, then the question is not who is working on world models. Everyone is. The question is who has the right map.

Xie has a specific map in mind. Understanding it requires seeing how it differs from four other maps that claim the same destination.

Five Definitions

Ask five world-model companies what a world model is, and you will get five different answers. This is not because the field is confused. It is because each answer reflects a different bet about what intelligence requires.

The Rendering Bet. Video generation companies, from OpenAI’s Sora to Google DeepMind’s Genie 3 to Runway and Luma, position themselves as world model builders. Their systems learn to generate visually coherent video that appears to obey physics: objects fall, liquids pour, light reflects. Some add interactivity, letting users control cameras or characters. Xie acknowledges this as progress beyond language models. “Video generation does carry real information about the world,” he says. A model that assigns higher likelihood to a smoothly running human than to one with hallucinated limbs has learned something about physical reality. But Xie draws a clear line: “These companies are still primarily focused on building a world simulator. Their goal is rendering video that looks good enough, with some consistency and control. This is the interface, not the core.”

The Spatial Bet. World Labs, founded by Fei-Fei Li, takes a different approach: generating explorable 3D environments with strong spatial consistency. Xie describes this as “more like a frontend, an asset interface.” He notes that Autodesk invested $200 million in World Labs, which makes sense: a company that sells 3D design tools would value a system that generates 3D assets. The representation is explicit 3D geometry, not hidden in neural network parameters. This gives guarantees that generative approaches cannot: you can be one hundred percent certain that the spatial structure is consistent, because it is explicitly represented. But it is a representation designed for a specific use case, not a general-purpose understanding of physical reality.

The Language Bet. The largest AI labs, OpenAI, Google, Anthropic, and their Chinese counterparts, treat language models as implicit world models. The reasoning: a model trained on the entire internet’s text has absorbed vast amounts of knowledge about how the world works. It knows cups are fragile, that fire burns, that gravity pulls things down. This knowledge can be extended through multimodal training: add vision encoders, add audio, add tool use. Xie’s verdict on this approach has been the subject of Parts 1 and 2 of this series. Language models are a necessary component of intelligence but a fundamentally flawed foundation for world understanding. “LLM will not serve as the foundation of the entire world model,” he says. It will become an interface: a communication layer that lets humans interact with the system, but not the system’s core.

The Symbolic Bet. Moonlake, a startup founded by Fan-Yun Sun and Sharon Lee, backed by AIX Ventures where Chris Manning is General Partner, with Ian Goodfellow as angel investor and advisor, recently published a positioning paper arguing that symbolic abstractions, code and language, are the most efficient representations for world models. Their reasoning: humans do not process the world in raw pixels. They use cognitive tools, above all language and mathematics, to create efficient abstractions that capture causal relationships. A world model should do the same. Interactive media like games provide both the data and the commercial incentive to build such systems. Moonlake agrees with Xie that pixel-level prediction wastes capacity on irrelevant details. But where Xie says language is poison, Moonlake says language is the most efficient compression of causal structure ever invented.

The Representation Bet. AMI Labs, where Xie serves as chief science officer alongside LeCun as executive chairman, bets on learned representations that are none of the above. Not pixels. Not language. Not explicit 3D geometry. Not symbolic code. A latent space learned from sensory data, optimized for prediction and action, in a format not designed for human consumption.

Xie frames this with a metaphor from baking. “You can think of what we are trying to build as the bottom layer of the cake. The base.” Language, video generation, and 3D assets are upper layers that can be decoded from this base. Once you have a strong enough representation, decoding it into language is straightforward. Decoding it into pixels for video is straightforward. Decoding it into actions for a robot is straightforward. But if the base is wrong, no amount of work on the upper layers will fix it.

The Bandwidth Argument

The deepest technical argument for the representation bet, one Xie develops at length in the interview, concerns bandwidth.

The human sensory system takes in information at roughly a billion bits per second. Vision alone accounts for most of this. Sound, touch, proprioception add more. This is the raw input: the unfiltered signal from the physical world.

Human behavioral output operates at roughly ten to a hundred bits per second. Compared to the sensory input, the bandwidth of even our fastest communication channel is vanishingly small.

Between input and output, the brain performs a compression of roughly seven orders of magnitude. A billion bits in, tens to hundreds of bits out. Whatever the brain is doing, it is not modeling every pixel. It is not preserving the full resolution of its sensory input. It is filtering, abstracting, discarding, and retaining only what matters for the next decision.

This, Xie argues, is what a world model actually does. It is a compression engine that takes high-bandwidth sensory input and produces low-bandwidth action decisions. The quality of the compression determines the quality of the intelligence. A model that retains the right information and discards the right noise will make good decisions. A model that retains noise and discards signal will make bad ones.

Language models skip this problem entirely. They start and end in the low-bandwidth channel. Text in, text out. They never confront the challenge of compressing a billion bits per second into a decision. Video models confront a partial version of the challenge: they take in pixels and produce pixels, which is high-bandwidth to high-bandwidth. But they are not compressing toward decisions. They are reconstructing the input at full resolution, which is the opposite of what the brain does.

The representation bet says: learn to compress. Not into words. Not into pixels. Into whatever internal format best preserves the information needed for prediction and action, while discarding everything else.

Downloading Humanity

If the representation bet is correct, the next question is immediate: where does the training data come from?

Language models had an answer ready-made: the internet. Billions of pages of text, uploaded by humanity over decades, free for the taking. The data was there. The architecture was ready. Scale followed.

World models face a harder version of this problem. Xie frames it as a generational shift: “The previous era was downloading the internet. The next era is downloading humanity.”

What he means: a four-month-old infant has already processed more visual data than the thirty trillion tokens used to train the largest language models. The information is there, in the world, streaming through every pair of eyes at a billion bits per second. But collecting it is a different problem from collecting text. You cannot scrape visual experience from a server.

Xie’s first-order answer is video. YouTube alone receives more data in thirty minutes of uploads than the entire training set of the largest language model. The data exists. The question is access. And here, Xie is candid about the difficulty: “YouTube prohibits scraping. You download a few videos and your IP gets banned.” The tension between platforms guarding their data and AI labs needing it is escalating.

But video is only the first step. Xie emphasizes that video is passive observation. A world model that understands the consequences of actions needs action-conditioned data: not just what happened, but what happened because someone did something. This data is scarcer by orders of magnitude. Games and interactive media are one source, which is partly why Moonlake is starting there. Robot interaction is another, but robot data is expensive and slow to collect.

Xie is honest about the uncertainty: “I don’t know for certain whether this path will succeed. I have sufficient confidence, but if you ask me whether it will one hundred percent succeed, I cannot say that. The reason is still the data.”

A Different Scaling Law

Part 2 of this series discussed Xie’s argument that scaling laws “have water in them.” In the landscape context, this claim takes on more specific implications.

Xie predicts that visual world models will have a scaling law, but one that looks completely different from language. The slope will be different. The ratio of parameters to data will be different. And crucially, the models may not need to be very large.

The reasoning: a language model must memorize vast amounts of factual knowledge to be useful. It needs to recall specific names, dates, events, and relationships. This requires parameters. A world model does not need to memorize the specific appearance of every cup it has ever seen. It needs to understand what happens when cups are pushed, dropped, filled, and emptied. The knowledge is structural, not encyclopedic.

"It does not need the kind of intelligence we prize most in humans," Xie says. “It does not need to store all this knowledge. It needs very good understanding ability to filter information, because in the end, what truly matters is the decision itself.”

This is why the bandwidth argument matters for scaling. The brain runs on twenty watts and compresses a billion bits per second into ten bits per second of behavior. If the compression is the intelligence, then the scaling law for world models may favor better compression over larger models. A smaller model that learns the right abstractions could outperform a larger model that retains the wrong details.

This remains a prediction, not a result. Language models have empirical scaling laws backed by years of data. World models have intuitions backed by neuroscience analogies. The gap between the two is where the risk lives.

Understanding and Generation Are One

The AI field has been debating whether understanding helps generation or generation helps understanding. Xie dismisses the debate itself.

“Neither question matters,” he says. “Understanding and generation are one thing. They both need a real world model as their foundation.”

This is the sharpest version of the representation bet. In Xie’s framing, the current landscape of “unified models” and “omni models” that try to combine understanding and generation by stacking all data modalities into a single system is solving the wrong problem. The question is not how to combine understanding and generation. The question is what foundation both should sit on.

If the foundation is language, you get a system that understands language descriptions of the world and generates more language descriptions. If the foundation is pixels, you get a system that understands visual patterns and generates more visual patterns. If the foundation is a learned representation that captures the causal structure of physical reality, you get a system where both understanding and generation are natural byproducts of the same underlying model.

“Once you have a good world model that can do prediction, planning, and reasoning,” Xie says, “the upper-layer decoding is very, very simple.”

The Map and the Territory

Five bets. Five maps. One territory.

The rendering bet says: learn from pixels, generate pixels, and physics understanding will emerge as a byproduct of visual fidelity. The evidence for this is Sora, Genie 3, and the implicit physics that video models demonstrably learn. The evidence against it is that these models still hallucinate objects, violate physics under contact, and cannot plan.

The spatial bet says: represent the world in explicit 3D, and spatial intelligence follows from spatial structure. The evidence for this is the guaranteed consistency of geometric representations. The evidence against it is that explicit 3D is a specific tool for specific use cases, not a general-purpose understanding.

The language bet says: language encodes enough world knowledge that scale will eventually close the gap. The evidence for this is the extraordinary commercial success of language models and their ability to absorb new modalities. The evidence against it has been the subject of this entire series.

The symbolic bet says: human cognitive tools, language and code, are the most efficient abstractions for causal reasoning. The evidence for this is that humans use these tools to build civilization. The evidence against it is Xie’s core argument: these tools are optimized for communication, not for physical understanding, and building on them imports all their limitations.

The representation bet says: learn a new abstraction from sensory data, one that is not constrained by any human-designed format. The evidence for this is the brain’s existence proof and early results from JEPA and similar architectures. The evidence against it is that the approach is the least proven, the most data-hungry, and has yet to produce a commercial application.

Xie says these arguments will seem absurd in a year or two, as all roads converge. He may be right about the convergence. But convergence on a destination does not resolve disagreement about the route. And in AI, the route determines what you build, what data you need, what products are possible, and how much money it costs to get there.

The five maps agree on where to go. They disagree on almost everything else. That disagreement is the most consequential debate in AI right now, and it will be settled not by arguments but by results that no one has yet produced.

Next: Part 4, the final piece, follows Xie from once dreaming of being a film director to co-founding a billion-dollar bet against the industry he helped build.

Robonaissance

Discussion about this post

Ready for more?