Language Is Poison, Part 2: The Bitter Lesson, Revisited

The essay that became Silicon Valley’s gospel may not say what the industry thinks it says. And the technology it supposedly validates may be a counterexample.

Mar 21, 2026

This deep-dive series explores a seven-hour interview with Saining Xie, hosted by Xiaojun Zhang in February 2026. To better convey the depth and ideas of the conversation, I have reorganized the narrative, added background context, and clarified some of the more technical points.

In March 2019, Rich Sutton, one of the founders of reinforcement learning, published a short essay on his personal website. Titled “The Bitter Lesson,” it ran to barely a thousand words. It made one argument: that the history of AI research teaches a single, painful lesson. Methods that leverage computation, through search and learning, always eventually beat methods that leverage human knowledge, through clever hand-designed features and rules. Researchers who encode their domain expertise into systems feel the satisfaction of elegant engineering, but in the long run, brute-force approaches that simply scale with compute win.

Sutton drew on examples from chess, Go, speech recognition, and computer vision. In each case, systems built on human expertise were eventually surpassed by systems that learned from data at scale. The lesson was bitter because it told researchers that their hard-won insights mattered less than they thought. Scale and general methods were what counted.

The essay became, in Saining Xie’s words, “a kind of scripture” for Silicon Valley. It provided the intellectual foundation for the scaling paradigm: build bigger models, train on more data, and the capabilities will come. Combined with the empirical observation of scaling laws, which showed predictable improvements in language model performance as compute increased, the Bitter Lesson became the justification for pouring hundreds of billions of dollars into GPU clusters and data centers. Scale is all you need. The lesson says so. And the success of large language models was treated as the most powerful empirical validation of this reading: each generation larger, each generation more capable, exactly as Sutton predicted.

Xie thinks this reading is wrong. Not slightly wrong. Structurally wrong. “I completely do not see Large Language Models as a demonstration of the Bitter Lesson,” he says. “In a sense, LLMs are anti-Bitter Lesson.”

The Hidden Supervision

The argument turns on what counts as human knowledge.

Sutton’s essay is clear: the bitter lesson is that we should minimize the injection of human knowledge into AI systems and maximize the use of search and learning. Hand-designed features lose. Learned representations win. The less human structure you impose, the better.

Now consider language. What is it? Xie’s answer: language is the product of thousands of years of human civilization, an intricate system evolved through social and individual processes to encode, compress, and communicate knowledge about the world. Every sentence ever written is a human artifact. Every book, every article, every forum post represents a human being processing their experience and encoding it in a symbolic system designed for communication.

The internet, Xie observes, is the largest knowledge upload in history. Humanity spent millennia producing language, then spent a few decades uploading all of it to servers. For language model researchers, this trove arrives free of charge. No annotation is needed. No labels are required. The data is just there.

But free is not the same as unlabeled. Xie poses a thought experiment: “Suppose we had no internet. Could you still train a language model? Suppose you had no books.” The answer is no. The knowledge that makes language model training possible is not a natural phenomenon like sunlight or gravity. It was constructed by human beings, sentence by sentence, over centuries. The upload process, writing, is itself a supervision construction process.

This leads to a reclassification. In standard machine learning terminology, supervised learning means training on data that has been labeled by humans. Self-supervised learning means training on unlabeled data by predicting parts of the input from other parts. Language model training looks like self-supervision: predict the next token from the preceding tokens, no external labels required. But Xie argues that the tokens themselves are labels. They are the output of human cognition, compressed into symbolic form. The entire dataset is one vast human annotation of reality, produced over millennia at staggering cost, then delivered to researchers for free.

“Many people doing NLP actually agree with this view,” Xie notes. “Language model training is not a self-supervised learning process. It is a strongly supervised learning process.”

If this reclassification holds, the implications for the Bitter Lesson are significant. Sutton’s lesson says to minimize human knowledge and maximize learning. Language models maximize human knowledge. Their training data is nothing but human knowledge. The fact that no one had to manually label each training example does not change the nature of what was learned from. It changes only the cost.

The x-Space and the y-Space

Xie develops this point with a formal analogy.

In machine learning, you have an input space x and a target space y. A model learns a mapping from x to y. The y-space is where supervision lives: it defines what the model should produce. The x-space is where the data lives: it represents the raw world the model observes.

Language models, Xie argues, operate entirely in the y-space. They model the distribution of human-generated tokens. Tokens are the output of human cognition: they represent what humans chose to say, which is a tiny, heavily filtered subset of what exists. A language model learns the patterns of this filtered output. It becomes expert at predicting what a human would say next, given what other humans have said before. This is powerful, useful, and commercially valuable. But it does not model the world. It models the record that humans made of the world.

The x-space, the raw sensory data from which humans built their understanding, barely enters the picture. A language model has never seen a cup fall. It has read thousands of descriptions of cups falling. It knows the word “shatter” follows “dropped” with high probability. It does not know the dynamics of fracture, the relationship between impact velocity and fragmentation pattern, or why a ceramic cup breaks differently from a plastic one. That information was discarded in the y-space compression.

“This is not enough to represent the whole world,” Xie says. “There are many things you cannot describe or characterize through language.”

What Vision Cares About

If language models operate in the y-space of human supervision, what does the x-space look like? Xie’s answer comes through the lens of vision.

Computer vision has never had a scaling law in the way language has. Xie describes a period of genuine despair among vision researchers: “We were desperate. Vision just never had a Scaling Law.” Video diffusion models show some scaling behavior, he acknowledges. More data and more compute do improve video quality. But the improvement does not follow the smooth, predictable curves that language models exhibit.

Xie now believes this is not a failure of vision research. It is a feature of the problem. “I increasingly believe that vision does not need a Scaling Law,” he says. “Because what vision cares about and what language cares about are completely different.”

The reasoning is precise. Language model scaling works because the supervision signal, text, scales linearly with data. More text means more tokens means more training signal. The signal is dense: every token is a prediction target, and every token was placed by a human being making a communicative choice. This density is what makes scaling laws smooth.

Vision does not have this property. A frame of video is not a sequence of human-placed tokens. It is a high-dimensional continuous signal generated by the physics of light interacting with matter. The “supervision” in the visual signal is not human-curated. It is a byproduct of physical law. Some of it is informative for understanding physics. Much of it is redundant: slight variations in lighting, texture, perspective that carry no new information about how the world works.

Scaling more video data does not proportionally increase the amount of useful physical information. You get more pixels, but not proportionally more physics. A world model does not need to memorize every visual detail it has seen, Xie argues. It needs to filter and understand. And video as a medium has structural limitations that more data cannot resolve: it is a 2D projection of a 3D world, forces are invisible in pixels, and standard frame rates alias fast motion. The useful physical signal is sparse, and adding more video does not change that.

This is why, Xie suggests, the scaling narrative “has water in it.” The phrase, borrowed from Chinese business language, means something like “inflated” or “contains hot air.” The scaling laws observed for language models reflect the specific properties of language data: dense, human-curated, and structured for communication. Extrapolating these scaling laws to other modalities, and treating them as universal laws of AI progress, assumes that all data has these properties. It does not.

Dyna and the Bitter Lesson

Xie’s next move is to reach back to Sutton’s 1991 Dyna paper.

In that paper, Sutton proposed that intelligent agents should learn a model of their environment and use it to simulate possible futures. Pure reinforcement learning, Sutton argued, is “primitive”: a model-free algorithm that learns only through direct interaction with the environment. A better system would have a world model, a predictive function that could forecast the consequences of actions without needing to execute them. Planning, the ability to search through possible futures using the world model, would give the agent capabilities that trial-and-error learning alone could never match.

Read together, the two papers point in the same direction. Dyna says: a world model is essential for intelligent behavior. The Bitter Lesson says: that world model should not be built from human-designed structure. A world model that is learned from data, rather than hand-coded by engineers, satisfies both. A world model built on human language satisfies neither.

“Language is a human product, an extremely clever product, with elegant design,” Xie says. “But it is all human knowledge. Not just some of it. All of it.” A language model discovers patterns within a representation space that humanity designed: the vocabulary, the syntax, the conventions of written communication. A vision model that learns to predict in an abstract latent space, without passing through language or pixels, would be closer to what the Bitter Lesson actually calls for: general learning methods applied to raw experience, with minimal human structure imposed.

The Hierarchy

Xie assembles these arguments into a hierarchy that reframes the entire AI landscape in terms of the Bitter Lesson.

At the first level, language models. They model P(y): the probability distribution over human-generated symbols. All structure is inherited from human language. Least Bitter Lesson.

At the second level, video generation models. They model P(x|y): the probability of pixel data conditioned on language. This requires more than language knowledge. To assign correct likelihoods to visual scenes, the model must learn that four-legged cats are more probable than three-legged ones, that smooth running motions are more likely than hallucinated limbs. This carries real information about the world. More Bitter Lesson than language models, but the representation, pixels, is still human-designed: a regular grid, 8 bits per channel, structured for human viewing. And the conditioning on language preserves the y-space dependency.

At the third level, learned representation models. They model neither words nor pixels. They learn latent representations optimized for prediction and action, in formats that are not designed for human consumption. The pixel grid disappears. The token vocabulary disappears. What remains is whatever structure the learning algorithm discovers in the raw data. “Why do I need to show it to a person?” Xie asks. “Rendering is the interface. It is not the core.” This is, in Xie’s framing, the true Bitter Lesson: a system that learns its own representations from sensory experience, with minimal human structure imposed.

The hierarchy clarifies what Xie means when he says LLMs are “anti-Bitter Lesson.” He is not saying language models are bad. He is saying they sit at the wrong end of a spectrum that Sutton himself defined. The lesson says to reduce human knowledge. Language models are built entirely on human knowledge. The lesson says to let learning find its own representations. Language models learn entirely within a representation that humanity designed: language itself, with all its structure, conventions, and built-in trade-offs.

The Cost of Being Right

Even if Xie’s reading of the Bitter Lesson is correct, it does not guarantee that his preferred approach will work in practice. The history of AI contains plenty of theories that were elegant, philosophically consistent, and technically sound, yet lost to less elegant approaches that simply scaled better.

Language models, for all their theoretical limitations, have delivered extraordinary practical value. They work. They scale. They improve predictably. These are not small virtues. The entire infrastructure of modern AI, from chips to clouds to applications, has been built around them.

AMI’s bet is that this infrastructure is solving the wrong problem. That the scaling laws driving investment are specific to language data and will not transfer to physical understanding. That the Bitter Lesson, read carefully, actually supports their position, not Silicon Valley’s.

Whether they are right will be determined not by arguments but by results. JEPA-based systems that demonstrably understand physics better than language models would vindicate the position. JEPA-based systems that remain research curiosities while language models continue to absorb new modalities and capabilities would refute it.

For now, what Xie has provided is something valuable regardless of the outcome: a precise, technically grounded counter-reading of the essay that an entire industry has treated as settled truth. The Bitter Lesson may be right. But Silicon Valley may be reading it wrong.

Next: Part 3 examines why “world model” means something different to every company claiming to build one, and what the landscape looks like when you map the disagreements.

Boyuan Xiao

Mar 21Edited

"It is lumpy, dependent on whether the new data happens to contain novel physical phenomena or merely more instances of phenomena the model has already seen."

Not sure I get the difference to scaling language data -there are a lot of texts that say the same thing

To me the biggest limitation of large scale video data is that:

- It's 2D - there aren't many multi-cam captures with calibration data

- We can't capture forces, so limited to kinematics rather than understanding dynamics

- Frame rates are too low in most videos, e.g. we see aliasing on wheels for 30fps

2 replies by Hugo and others

2 more comments...

Robonaissance

Discussion about this post

Ready for more?