Language Is Poison, Part 1: World Model, Not Word Model

AMI Labs’ chief science officer believes the AI industry’s most successful technology is also its most seductive trap. His company has just raised $1.03 billion to prove it.

Mar 20, 2026

This deep-dive series explores a seven-hour interview with Saining Xie, hosted by Xiaojun Zhang in February 2026. To better convey the depth and ideas of the conversation, I have reorganized the narrative, added background context, and clarified some of the more technical points.

In March 2026, a Paris-based startup called AMI Labs announced a $1.03 billion seed round at a $3.5 billion pre-money valuation. It was the largest seed round in European history. The company had roughly two dozen employees and zero products. Its backers included NVIDIA, Samsung, Toyota Ventures, and Bezos Expeditions.

What was the bet? Not a better language model. Not a more realistic video generator. AMI Labs was built on a claim that the entire AI industry has the architecture of intelligence backwards. Language, the foundation of every frontier AI system from GPT to Gemini to Claude, is not the path to understanding the physical world. It is, in the words of AMI’s chief science officer, a poison.

Saining Xie does not look like someone who would use a word like “poison” to describe the most successful technology paradigm in the history of artificial intelligence. Born in 1990, trained at Shanghai Jiao Tong University and UC San Diego, faculty at NYU, former research scientist at Google DeepMind and Meta FAIR, co-author of Diffusion Transformers, the architecture powering Sora and most video diffusion models. Nearly 100,000 citations. The kind of CV that usually produces diplomatic understatement, not provocation.

But in a seven-hour marathon interview hosted by Xiaojun Zhang in a Brooklyn apartment after a February blizzard, Xie laid out a case against the language-centric AI paradigm that was precise, technical, and unsparing.

“Language is a poison,” he said. “Or perhaps an opium.”

This is not a metaphor he tossed off casually. It is the compressed version of an argument that runs through the entire conversation, touching representation learning, scaling laws, the Bitter Lesson, and what a world model actually needs to be. The argument deserves unpacking, because if Xie is right, the implications extend far beyond one startup’s research agenda.

The Communication Tool Mistake

Xie’s argument begins with a distinction that sounds obvious but, he contends, the entire field has failed to internalize.

Language is a communication tool. It is not a thinking tool. It is not a decision-making tool. It was designed, through thousands of years of cultural evolution, to transmit intent between humans. That purpose required trade-offs. To be efficient for communication, language discards vast amounts of information about the physical world.

Xie illustrates this with a simple example. “A cup fell on the ground and broke.” This sentence is perfect for communication. It conveys the relevant outcome. But it discards the dynamics: how the cup broke, which physical laws governed the fracture pattern, what the underlying mechanics were. We do not care about those things when we are talking to each other. A world model must care about exactly those things.

The problem, in Xie’s framing, is that the AI field has mistaken a tool optimized for human communication for a tool capable of modeling the world. A large language model trained on text absorbs a compressed, lossy, communication-optimized description of reality. It can tell you that cups are fragile. It cannot tell you how much force will break one.

This is where the opium metaphor enters. “Adding language always makes you feel happier,” Xie observes. Plug a language backbone into any system, and the benchmarks improve. The system can answer questions, follow instructions, describe what it sees. “But it is a crutch. If you keep using the crutch, you can never train your leg muscles.” The system gets better at language-mediated tasks while its capacity for direct physical understanding atrophies, or never develops at all.

The Tokenization Problem

The critique moves from philosophy to engineering when Xie describes what happens when continuous visual signals are processed through a language model architecture.

Consider a person turning their head five or ten degrees. This generates hundreds of frames of visual information at high frequency. The human visual system processes this as a continuous, coherent stream, updating a persistent spatial model in real time.

A language-model approach handles this differently. Each frame is tokenized: broken into a grid of patches, each patch converted into a discrete token. The tokens from all frames are concatenated into a long sequence. A 256-token frame over 128 frames produces a sequence of 32,768 tokens. This sequence is fed into a transformer that has no built-in notion of spatial structure. Every token can attend to every other token regardless of spatial proximity, and the model must learn all spatial relationships from scratch.

Xie argues that this process is fundamentally wrong for visual and physical data. The world has a global state: a persistent, coherent, three-dimensional structure that exists continuously regardless of where you are looking. Tokenization serializes this global state into a flat sequence of fragments. The transformer has no prior knowledge that spatially adjacent fragments are more related than distant ones. It must discover all spatial structure through learning, spending computation on every pairwise relationship in a sequence that grows quadratically. Most of the information in a visual stream is redundant from frame to frame. The meaningful signal is sparse: a change in depth, a new object entering the field, a force being applied. Lacking any built-in spatial structure is not a neutral architectural choice. It is an inductive bias that ignores what we already know about the structure of physical reality.

“The modeling technique of language models cannot solve the problem of understanding continuous spatial signals,” Xie states. “This is not a matter of scale. It is structural.”

The Pollution

If this were only a technical disagreement about architectures, it would be interesting but contained. What makes Xie’s argument broader is his claim about how the language-centric paradigm distorts the entire research ecosystem.

The mechanism, he argues, is economic. There is a value chain that runs from narrative to capital to talent to research direction. At the top sits a set of interlocking stories: AGI is near, scaling laws will hold, the Bitter Lesson says to scale. These narratives attract investment. Investment funds labs. Labs hire researchers. Researchers, working within institutions built around language models, route their work through language architectures, because that is where the infrastructure, the compute, and the benchmarks are.

The result is that vision research is being forced onto language backbones, not because this is the right architecture for visual understanding, but because this is where the resources flow. Xie has a specific concern: when a vision encoder’s features enter a language model, they are forced to align with token space. The spatial structure, the continuous dynamics, the physical relationships, all get compressed into a format designed for discrete sequential processing. The vision system does not get augmented by language. It gets distorted by it.

“I am deeply worried about the pollution of vision by language,” Xie says. “And it is already happening.”

He notes that researchers building large language models have themselves raised the concern from the opposite direction: adding vision might lower their models’ reasoning performance. Xie’s response is characteristically blunt: “If you don’t add vision, you will certainly be unintelligent. But the real question is how you define intelligence.” He invokes Moravec’s Paradox: what is easy for machines is hard for humans, what is easy for humans is hard for machines. The current paradigm optimizes for the machine-easy side of this divide, treating text reasoning as the pinnacle of intelligence while physical interaction, the thing that every toddler masters and no robot can reliably do, remains unsolved.

Three Levels of the Bitter Lesson

Xie’s most technically developed argument concerns the Bitter Lesson itself. Richard Sutton’s 2019 essay argued that general methods leveraging computation always eventually defeat methods that exploit human knowledge. The AI field has treated this as gospel: scale is all you need, and the scaling should be applied to language models.

Xie disagrees. His counter-argument has a specific structure.

Language, he observes, is entirely the product of human knowledge. It was created by human civilization over millennia, refined through cultural evolution, and uploaded to the internet. The fact that training data comes for free does not mean it comes without labels. Every sentence on the internet was written by a human being with intent, knowledge, and communicative purpose. “Suppose we had no internet,” Xie poses. “Could you still train a language model? Suppose you had no books?” The knowledge upload that makes language model training possible is itself a supervision construction process. By Sutton’s own criterion, LLM training is not self-supervised learning from raw experience. It is strongly supervised learning on humanity’s curated output.

From this, Xie builds a hierarchy. Language models operate in label space: they model the probability distribution over human-generated tokens. This is the least “Bitter Lesson” level, because the representation is entirely human-designed.

Video generation models take one step further. They model the probability of pixels conditioned on language. This requires more: the system must learn that a four-legged cat is more probable than a three-legged one, that a smooth running motion is more likely than a hallucinated limb. “You need to know something about the world to assign these likelihoods,” Xie acknowledges. Video models carry more information than language models. But pixels, too, are a human-designed interface: a regular grid of values, 8 bits per channel, structured for human viewing.

The true Bitter Lesson, in Xie’s framing, requires going further still. Beyond language, beyond pixels, to learned representations that are not designed for human consumption at all. A world model’s core should not be something you can watch on a screen. It should be something the system uses internally to predict, plan, and act. “Why do I need to show it to a person?” Xie asks. “Rendering is the interface. It is not the core.”

The Predictive Brain

What does AMI Labs propose to build instead?

Xie describes it as a “predictive brain.” Not a world simulator that renders video. Not a 3D asset generator. Not a chatbot with vision. A system whose core is representation learning for physical world understanding: learning the internal structure of reality well enough to predict the consequences of actions, plan over long horizons, and reason about causes and effects.

In this architecture, language does not disappear. It becomes a module, an interface for communication with humans, not the foundation on which everything else is built. Visual generation becomes another module: a decoder from the internal representation to pixels, useful when humans need to see what the system is thinking, but not the system’s primary mode of operation. Action planning becomes a third decoder: from representation to motor commands.

“Representation is the most important part of a world model,” Xie says. “It is not the whole thing. But it is the most important part.” Once you have a strong enough representation, you can decode it into language, into pixels, into actions. The foundation is the representation. Everything else is interface.

This replaces the current paradigm at its root. Today, language is the foundation and everything else is built on top of it. Xie argues the foundation should be a learned representation of the physical world, and language, pixels, and actions are all interfaces decoded from it.

The Landscape

Xie is careful to acknowledge that everyone in the field is heading toward world models. He is also wary of the term becoming a buzzword. The one thing he does like about it, he says, is something Berkeley professor Jitendra Malik once told him: “The only reason I like ‘world model’ is that it tells people I’m building a World Model, not a Word Model.” Word model, Xie adds, is just another name for LLM.

Still, he sees the convergence as real. “World model is not a technique,” he says. “It is not an algorithm. It is a purpose. Everyone, whether doing LLM or video diffusion or Gaussian Splatting, is on the road to world models. In a year or two, all these arguments will seem absurd, because we are all heading in the same direction.”

But the paths, right now, are very different. Video generation companies like OpenAI, Google DeepMind, and Runway are building world simulators that render increasingly realistic scenes. World Labs, founded by Fei-Fei Li, is building spatial 3D representations, what Xie calls “a frontend, an asset interface.” The LLM labs treat language as an implicit world model, routing all understanding through text. AMI is betting on learned representations that are none of these: not pixels, not language, not 3D assets, but an abstract latent space optimized for prediction and action.

There is also another path worth noting. Moonlake, a startup founded by Fan-Yun Sun and Sharon Lee, backed by AIX Ventures where Chris Manning is General Partner, with Ian Goodfellow as angel investor and advisor, recently argued that symbolic abstractions, code and language, are the most efficient representations for world models. Their reasoning: abstractions focus computational capacity on the aspects of the world that matter for decision-making, rather than wasting it on pixel-level details.

Xie’s response to this path is implicit but clear. If language is poison, then building your world model on language is drinking more of it. Moonlake and AMI agree that pixels are inefficient. They disagree about what should replace them. Moonlake says: human-designed symbols. AMI says: learned representations that owe nothing to human cognitive tools.

The debate, stripped to its essentials, has five positions. Predict in pixel space. Predict in explicit 3D. Predict through language. Predict in symbol space. Predict in learned representation space. Each has funded companies behind it. Each has credentialed researchers making the case. None has yet proven the others wrong.

The Stakes

A billion dollars is not an argument. It is a price tag on a conviction.

The conviction behind AMI Labs is that the language-centric paradigm, for all its commercial success, is structurally incapable of producing systems that understand the physical world. Not because language models are bad at what they do. They are extraordinary at what they do. But what they do is model the distribution of human communication, which is a lossy, purpose-built compression of reality, not reality itself.

If Xie is right, the consequences extend beyond architecture choices. The trillions of dollars flowing into language model infrastructure, the talent concentrated around language-first research, the benchmarks that measure intelligence by the standards of text, all of this is optimizing for the wrong target. Not wrong in the sense of useless. Wrong in the sense of insufficient. Language models will keep getting better at language tasks. But the gap between language intelligence and physical intelligence will not close by making language models larger. It will close by building something different.

Whether AMI’s “something different” works remains entirely unproven. JEPA, the architecture underlying LeCun’s vision, has produced early results but nothing approaching the commercial impact of language models. The $1.03 billion buys time and talent, not validation. The validation will come, if it comes, from systems that can predict, plan, and act in the physical world better than anything the language paradigm can produce.

Until then, what we have is a bet. The most expensive bet in European AI history, placed by researchers who helped build the systems they are now arguing against, on the proposition that the field’s most successful technology is also its most seductive trap.

Language is a poison, Xie says. Not because it is useless. Because it is so useful that it prevents you from building what you actually need.

Next: Part 2 explores what the Bitter Lesson actually says, and why Xie believes Silicon Valley has been reading it wrong.

Markus Schatzl

May 25

My intuition is this can be just about refining the degree of resolution, but probably never overcomes the limits of discrete space.

Carbon & Silicon

May 24

Great read - and exactly what LeCun has been saying for over a decade …he was laughed at for a while there. I’m a big fan. Love the article, looking forward to the series.

2 more comments...

Robonaissance

Discussion about this post

Ready for more?