World Models
Learning How The World Works
This is Chapter 13 of A Brief History of Artificial Intelligence.
In the spring of 2022, Yann LeCun stood before an audience at Meta’s AI research summit and made a claim that would have seemed strange just a few years earlier. The most celebrated language models in the world, he argued—GPT-3, ChatGPT, the systems capturing headlines and imaginations—were missing something fundamental.
“They don’t understand the world,” LeCun said. “They understand language about the world. That’s not the same thing.”
LeCun had spent four decades in AI research. He had pioneered convolutional neural networks in the 1980s, endured the winters when neural networks were dismissed, and emerged as one of deep learning’s founding figures—a Turing Award winner, the chief AI scientist at Meta. When he spoke, people listened. And what he was saying contradicted the prevailing narrative.
The narrative went like this: language models trained on enough text would learn everything they needed to know. Language, after all, describes the world. If you could predict language perfectly, you would implicitly understand everything language talks about—physics, psychology, causality, common sense. Scale the models big enough, train them on enough text, and understanding would emerge.
LeCun disagreed. Language, he argued, was a thin slice of reality. A child learns that objects fall when dropped not by reading about gravity but by dropping things. A baby understands that faces exist before they learn the word “face.” The world has structure—physical, causal, temporal—that exists independent of how we talk about it. And current AI systems weren’t learning that structure. They were learning to manipulate symbols that referred to it.
What AI needed, LeCun believed, was world models—systems that learned how the world actually worked, not just how people talked about how the world worked.
The Gap Between Words and Worlds
The distinction LeCun was drawing had deep roots in AI research.
Language models learn by prediction. Given a sequence of words, they predict what comes next. This is remarkably powerful—as earlier chapters have shown, prediction drives compression, and compression drives understanding. But the understanding is grounded in language, not in the world that language describes.
Consider what happens when you ask a language model about physics. “If I drop a ball from a tall building, what happens?” The model will correctly say the ball falls. It has seen countless sentences describing falling objects. But does it understand why the ball falls? Does it have a model of gravity, of mass, of acceleration? Or does it simply know that sentences about dropped objects are typically followed by sentences about falling?
The question isn’t academic. It matters for what these systems can and cannot do.
A language model can describe how to ride a bicycle—”maintain balance by subtle steering adjustments, lean into turns, pedal smoothly.” But it has never felt the wobble of an unbalanced bike, never experienced the shift of weight that precedes a fall, never learned through thousands of small corrections what “balance” actually means. Its knowledge is linguistic, not embodied. It can generate text about bicycles without understanding bicycles.
This shows up in systematic failures. Language models struggle with physical reasoning—predicting what happens when objects collide, understanding how liquids flow, reasoning about mechanical systems. They struggle with counterfactuals—”what would have happened if X had been different?”—because counterfactual reasoning requires a model of how causes lead to effects, not just knowledge of what typically follows what. They struggle with novel situations that don’t resemble their training text.
The failures point to what’s missing: a model of the world itself.
What World Models Would Look Like
The idea of world models isn’t new. It traces back to the earliest days of AI.
In 1959, a researcher named Arthur Samuel built a program that learned to play checkers by playing against itself millions of times. The program didn’t just learn which moves were good—it developed an internal representation of the game that let it evaluate positions it had never seen before. It had built a model of the checkers world.
In the 1980s, robotics researchers developed the concept of “internal models”—mental representations that let robots predict the consequences of their actions before taking them. A robot reaching for a cup doesn’t just blindly extend its arm. It simulates the reach internally, predicts where the arm will end up, and adjusts the plan if necessary. The internal model lets it think before it acts.
What would this look like for modern AI? LeCun has proposed a framework he calls JEPA—Joint Embedding Predictive Architecture. The idea is to learn representations of the world that support prediction not in the space of raw inputs (pixels, words) but in a more abstract space of meaning.
Imagine a system watching a video of a ball rolling toward a wall. A traditional model might try to predict the next frame of video—every pixel. But most of those pixels are irrelevant to what matters: the ball will bounce off the wall. A world model would learn to represent the scene at a higher level—ball, wall, trajectory, collision—and predict what happens in that abstract space. It would learn physics, not pixels.
The vision is compelling. Instead of predicting tokens, predict states of the world. Instead of learning linguistic patterns, learn causal structure. Instead of generating text about reality, build a simulation of reality itself.
Video and the Visual World
One path toward world models runs through video.
David Ha, a researcher who worked at Google Brain and later co-founded a company called Sakana AI, has been exploring this direction since the mid-2010s. Ha had an unusual background for an AI researcher—he’d worked as a derivatives trader in Tokyo before pivoting to machine learning, bringing a quantitative intuition about prediction and uncertainty. His insight was simple but profound: video is a window into the physical world. A model that can predict video—that can watch a scene and anticipate what happens next—must learn something about how the world works.
In 2018, Ha and Jürgen Schmidhuber published a paper called “World Models” that demonstrated the approach. They trained a neural network to play simple video games—racing games, obstacle avoidance games—not by learning good actions directly, but by first learning to predict what would happen in the game world. The network developed a compressed representation of the game—a world model—and then learned to act within that model.
The architecture had three parts. First, a vision system that compressed each frame into a small representation. Second, a model that predicted how that representation would change over time—the world model proper. Third, a controller that learned to take actions based on the current state and the predicted future. The system could imagine scenarios it had never seen, simulate the consequences of actions before taking them, even dream up entirely new game levels that followed the game’s internal logic.
The results were striking in a specific way. The system could learn to play games almost entirely “in its imagination”—training within its learned world model rather than in the actual game environment. This was more than a trick. It suggested that if you had an accurate enough model of the world, you could reason about it without constantly interacting with it. You could think before you acted.
Scaling this to the real world is vastly harder. Real videos are infinitely more complex than simple games. Objects occlude each other, lighting changes moment to moment, cameras move, scenes contain countless details that matter in subtle ways. A ball rolling behind a couch doesn’t disappear—a good world model should track it, predict where it will emerge. Liquids don’t just flow; they splash, drip, pool in ways that depend on surface tension, viscosity, container shape. People don’t just move; they act with intention, respond to each other, have goals that shape their behavior.
But the vision remains: learn to predict video, and you learn how the visual world works.
Recent systems like OpenAI’s Sora, which generates video from text descriptions, hint at what’s possible. Sora can create videos that show plausible physics—objects fall realistically, liquids flow naturally, people move with human grace. Ask it for a video of a dog running on a beach, and you get waves that break properly, sand that kicks up under paws, a tail that wags with appropriate physics.
Critics point out that Sora also makes mistakes—sometimes limbs merge, sometimes objects pass through each other, sometimes physics breaks in subtle ways. But the mistakes may be less important than the successes. The system has learned enough about the world to generate videos that mostly make sense physically. It has learned something about how things move, how gravity works, how objects interact. The question is whether this is true understanding or sophisticated pattern matching—and whether, for practical purposes, the distinction matters.
Causality and Counterfactuals
World models aren’t just about physics. They’re about causality—understanding not just what happens, but why it happens and what would happen if things were different.
Judea Pearl, the computer scientist and philosopher who won the Turing Award for his work on probabilistic reasoning, has argued that current AI systems are fundamentally limited because they can’t reason causally. They can find correlations in data—that certain symptoms predict certain diseases, that certain words follow certain other words—but they can’t distinguish correlation from causation. They can’t answer “why” questions or reason about interventions.
Consider a simple example. Data might show that people who carry lighters are more likely to develop lung cancer. A purely correlational system might conclude that lighters cause cancer. A causal reasoner would understand that carrying lighters doesn’t cause cancer—smoking causes both the lighter-carrying and the cancer. The lighter is a confounder, not a cause.
Causal reasoning matters because the world is causal. Actions have consequences. Interventions change outcomes. Planning requires predicting what will happen if we do something, not just what typically happens when certain conditions occur. A truly intelligent system needs to reason about causality, not just correlation.
World models that incorporate causal structure would be more powerful than current systems in several ways. They could answer counterfactual questions: “What would have happened if I had taken a different action?” They could plan more effectively: “What action will lead to the outcome I want?” They could learn from less data by understanding the underlying causal structure rather than just memorizing surface patterns.
Building such systems remains a frontier challenge. Current language models learn correlations from text—”when X, usually Y”—but text rarely contains explicit causal information. The world contains causal structure, but that structure isn’t written down. Learning it may require different kinds of data, different architectures, different training approaches than current systems use.
Self-Supervised Learning and the Future
How do you learn a world model? You can’t easily label the “correct” model of the world—unlike image classification, where you can annotate pictures with their contents, world models aren’t something humans can straightforwardly supervise. No one can hand-label the laws of physics that govern a video, the causal structure underlying a scene, the intuitive psychology of people walking down a street.
The answer may be self-supervised learning—learning from the structure of data itself, without explicit labels.
Language models are self-supervised. They learn by predicting masked or future words, using the text itself as training signal. No human needs to label each prediction as correct or incorrect; the text provides its own supervision. This is what enabled training on billions of words of internet text without any manual annotation—a scale of learning that would be impossible if humans had to label each example.
The same principle could apply to world models. Learn by predicting what happens next in videos. Learn by predicting the consequences of actions in simulations. Learn by predicting how scenes change over time—how a pushed object moves, how a dropped ball falls, how a walking person continues their path. The world itself provides the supervision. The question is whether prediction of sensory data is enough to learn the underlying structure.
LeCun’s JEPA architecture is designed for exactly this kind of self-supervised learning from video and other sensory data. The key insight is that predicting raw pixels is both too hard and too easy. Too hard because most pixels are irrelevant noise—the exact texture of a wall doesn’t matter for understanding that a ball will bounce off it. Too easy because you can predict many pixels by just copying the previous frame—a simple strategy that doesn’t require understanding.
JEPA instead learns to predict in representation space. The system learns compressed representations of scenes, then predicts how those representations will change. It learns to predict what matters, not what’s merely present. A ball rolling toward a wall gets represented in terms of trajectory, velocity, impending collision—the abstract features that determine what happens next. The prediction happens in this abstract space, forcing the system to learn structure rather than just copying pixels.
The vision is ambitious. A system that watches enough video of the world might learn intuitive physics—how objects move, how they interact, how forces work. It might learn intuitive psychology—how people behave, what they’re likely to do, what their gestures and expressions reveal about their intentions. It might learn intuitive biology—how animals move differently than inanimate objects, how plants sway in wind, how living things are distinguished from mere matter. Not by reading about these things, but by observing them.
Such a system would have what current AI lacks: grounded understanding. Its knowledge wouldn’t come from language about the world but from the world itself. It could reason about physics because it had learned physics from observation, not from descriptions. It could predict consequences because it had seen how causes lead to effects, not just read about causality. The gap between words and worlds would close.
The Path Ahead
World models remain more vision than reality. The systems that exist today—language models, image generators, video predictors—are impressive but limited. They learn patterns from data without necessarily learning the deep structure underlying those patterns. They predict what comes next without understanding why.
But the direction seems clear. Language models showed that prediction is powerful—that learning to predict text creates systems that can answer questions, write code, and engage in conversation. World models extend this insight: learning to predict the world might create systems that understand the world.
The challenges are substantial. Video is harder to predict than text. Causal structure is harder to learn than correlations. Grounded understanding may require embodiment—learning from interaction with the world, not just observation of it. The path from current systems to genuine world models may be long.
Yet the potential is profound. A system with an accurate world model could simulate consequences before acting, plan effectively in novel situations, reason about counterfactuals, and understand causality. It would have something current AI systems lack: a grasp of how reality works.
LeCun, who has been right about many things and wrong about a few, believes this is where AI must go. Language models, for all their achievements, are a detour—useful, impressive, but ultimately limited by their foundation in words rather than worlds. The next revolution, if it comes, will be systems that model reality itself.
Predicting tokens taught systems language. Predicting the world might teach them understanding.
Notes and Further Reading
On LeCun and World Models
Yann LeCun has articulated his vision for world models in various talks, papers, and social media posts since 2022. His JEPA (Joint Embedding Predictive Architecture) proposal is detailed in technical papers available on arXiv. The core argument—that language models learn about language, not about the world—has sparked significant debate within the AI research community.
On Ha and Schmidhuber’s World Models
The 2018 paper “World Models” by David Ha and Jürgen Schmidhuber demonstrated learning world models for simple video games. The work showed how systems could learn to imagine scenarios and plan within learned models. Ha’s subsequent work has continued exploring these directions.
On Causality and Pearl
Judea Pearl’s “The Book of Why” (2018) provides an accessible introduction to causal reasoning and its importance for AI. Pearl argues that current machine learning is limited to pattern recognition and cannot perform genuine causal inference. The debate about whether and how AI systems can learn causal structure remains active.
On Video Generation and Prediction
OpenAI’s Sora and similar video generation systems represent one path toward learning about the physical world. While these systems can generate plausible videos, questions remain about whether they truly understand the physics they depict or merely reproduce statistical patterns from training data.
On Self-Supervised Learning
Self-supervised learning has transformed natural language processing and is increasingly influential in computer vision. The key insight—that data can provide its own supervision—may be essential for learning world models, since explicit supervision of “world understanding” is difficult to provide.


