The LeCun Bet
A billion dollars on a single question: does the order in which a machine learns determine how well it understands?
In November 2025, Yann LeCun posted a short message on LinkedIn confirming what had been rumored for weeks. After twelve years at Meta, five as founding director of FAIR and seven as chief AI scientist, he was leaving to start his own company. The post was gracious. He thanked Zuckerberg, Bosworth, Cox, Schroepfer. He announced that Meta would partner with his new venture. He described his goal in a single sentence: systems that understand the physical world, have persistent memory, can reason, and can plan complex action sequences.
What the post did not say was more revealing than what it did. It did not say the words “large language model.” It did not mention chatbots, or assistants, or any of the products consuming the vast majority of Meta’s AI budget. And it did not mention Alexandr Wang, the twenty-eight-year-old former CEO of Scale AI whom Zuckerberg had installed as Meta’s new chief AI officer six months earlier, to whom LeCun now reported.
The backstory, as it emerged in subsequent reporting, was a story of institutional divergence. Zuckerberg had poured $14 billion into Scale AI and hired Wang to lead a new Superintelligence Labs focused on LLMs. The talent recruited into the new division, LeCun later told the Financial Times, was “completely LLM-pilled.” When Meta’s Llama 4 model launched to widespread criticism, LeCun blamed the pressure to ship safe, proven approaches rather than implement the new ideas FAIR had developed. FAIR itself had been losing researchers and compute for years. Former employees described it as dying a slow death.
LeCun’s departure was not a retirement. It was a declaration. He was taking his forty-year conviction about how intelligence works and betting everything on it.
The Forty-Year Thread
To understand what LeCun is building at AMI Labs, you have to understand what he has always believed. The through-line is representation.
In the late 1980s, a young LeCun developed convolutional neural networks at Bell Labs, architectures that learned to represent visual patterns in hierarchical layers of increasing abstraction. Edges become textures become objects. The key insight was not the specific architecture. It was the principle: if you get the representation right, the downstream task becomes easy. A good representation compresses away the irrelevant and preserves the essential.
This principle has guided every major turn in his career. At NYU, his lab pushed self-supervised learning methods years before they became fashionable. At FAIR, he championed open research and representation-first approaches while the rest of the industry raced to scale language models. His public skepticism of LLMs, which grew louder from 2019 onward, was never about whether they were useful. It was about whether they could ever produce genuine understanding. LLMs predict the next token. They are trained to be linguistically plausible. LeCun argued that this training signal, no matter how much data you pour into it, cannot produce a system that understands why a ball falls or how a door opens. It can only produce a system that knows what words typically follow other words about balls falling and doors opening.
In June 2022, he wrote it all down. “A Path Towards Autonomous Machine Intelligence” was not a typical research paper. It was a position paper, a blueprint for the kind of AI he believed the field should be building instead. At its center was an architecture called JEPA: Joint Embedding Predictive Architecture.
What JEPA Actually Is
The easiest way to misunderstand JEPA is to compare it too casually to what already exists. It operates in latent space. So does Dreamer. It is trained on massive video data. So are multimodal LLMs. It can be adapted to robot tasks. So can many foundation models. If the distinctions are only architectural details, then AMI Labs is just another foundation model company with a different training objective. The billion-dollar question is whether the distinctions are fundamental.
They come in three layers.
Layer one: what the model is asked to predict. A generative model, whether it generates text tokens or video pixels, is asked to reconstruct the input. An LLM predicts the next word. A video generation model predicts the next frame. Both must allocate compute to every detail of their prediction, including details that are inherently unpredictable: the exact rustle of a leaf, the precise pattern of light on water, the specific fold of a crumpled cloth. LeCun’s argument is that this is not just inefficient. It is conceptually wrong. A system that spends capacity predicting irrelevant visual noise is learning the wrong thing about the world.
JEPA does not predict inputs. It predicts representations. The model masks a region of space-time in a video, encodes the visible context into a representation, and then predicts the representation of the masked region. The encoder’s information bottleneck collapses irrelevant variation by design. Two scenes with different lighting but the same causal structure produce the same representation. The model is trained to capture what is invariant across surface variation, which, LeCun argues, is precisely what “understanding” means.
Layer two: how the model relates to action. This is where the distinction from Dreamer matters most, and where it is easiest to get wrong.
Dreamer, the world model architecture developed by Danijar Hafner, also operates in latent space rather than pixel space. It also learns a compressed representation of the environment. But Dreamer couples representation learning, action conditioning, and reward from the very first training step. The model learns to represent whatever helps the agent score higher. It is an action-conditioned simulator: you feed it a state and a proposed action, it returns the predicted next state. The representation exists to serve the policy.
JEPA decouples representation from action. In the first phase, the model learns to represent the world through self-supervised prediction on video. No actions. No rewards. No tasks. Just the structure of physical reality as revealed by visual observation. Only after this representation is established does a second phase introduce action conditioning: a separate module learns to predict what happens when you act, using the frozen representation as its foundation.
LeCun’s full blueprint, described in the 2022 paper, does include action-conditioned planning. The complete architecture has a world model module that takes a state representation and an action and predicts the next state representation. So the mature system can answer “if I do X, what happens next,” just like Dreamer. The difference is when action enters the learning process. Dreamer says: understanding emerges from trying to act. JEPA says: understanding must be established first. Then action can use it.
Layer three: hierarchical planning. LLMs generate token by token, sequentially. To evaluate a plan, you must roll out the entire sequence step by step. LeCun’s blueprint describes a hierarchical JEPA, H-JEPA, that makes predictions at multiple levels of abstraction simultaneously. High-level representations predict coarse future states over long time horizons. Low-level representations predict fine-grained states over short horizons. This enables planning at multiple scales without the cost of sequential rollout.
This third layer is the most ambitious and the least proven. It exists only in the blueprint. No implementation of H-JEPA has been published. It is the architectural equivalent of a load-bearing wall that has been drawn on the plans but not yet built.
The Evidence So Far
V-JEPA 2, published in June 2025 while LeCun was still at Meta, is the most complete test of the first two layers.
Phase one: the model was pre-trained on over one million hours of internet video and one million images through self-supervised learning. No action labels, no reward signals. The result was a ViT-g encoder with roughly one billion parameters that achieved state-of-the-art performance on motion understanding benchmarks and human action anticipation tasks.
Phase two: the frozen encoder was paired with a new action-conditioned predictor, V-JEPA 2-AC, trained on just 62 hours of robot manipulation data from the open-source Droid dataset. Sixty-two hours. For comparison, Physical Intelligence’s pi-zero was trained on over 10,000 hours.
The results on robot tasks were striking. V-JEPA 2-AC was deployed zero-shot on Franka Panda arms in two labs that appeared nowhere in the training data, using an uncalibrated monocular camera. No task-specific training. No reward signal. No data collected from the deployment environment. The robot picked up a cup with 80% success. The baseline comparison, Octo, a video-language-action model, managed 15% on the same task. The efficiency comparison with NVIDIA’s Cosmos was equally stark: V-JEPA 2-AC planned each action in 16 seconds. Cosmos required four minutes.
These numbers support the first half of LeCun’s thesis: a representation learned through self-supervised video prediction can transfer to robot tasks with remarkably little robot-specific data. The decoupled approach, learning representation first and adding action second, appears to produce useful physical understanding from orders of magnitude less robot data than the Robot’s Road requires.
But the numbers come with caveats. The tasks were foundational: reaching, grasping, pick-and-place. Not the kind of sustained, multi-step manipulation that Physical Intelligence demonstrated with espresso machines and laundry folding. The comparison with Octo was fair but limited, since Octo is not the strongest baseline available. And the 62-hour claim, while impressive, describes only the action-conditioning phase. The representation was pre-trained on a million hours of video, a resource investment comparable to any frontier model.
The Billion-Dollar Bet
In March 2026, four months after leaving Meta, LeCun’s AMI Labs closed a $1.03 billion seed round at a $3.5 billion pre-money valuation. The round nearly doubled the 500 million euros initially sought. It was co-led by Cathay Innovation, Greycroft, Hiro Capital, HV Capital, and Bezos Expeditions, with NVIDIA and Temasek among the strategic backers.
NVIDIA’s participation is worth noting. The company that invested billions in the Physicist’s Road, building simulation infrastructure through Omniverse and Cosmos, is simultaneously betting on the Architect’s Road. This is not contradiction. It is portfolio hedging by the company best positioned to understand that no single approach to world models has won.
The team LeCun assembled reflects the bet’s character. Alexandre LeBrun, the CEO, came to world models from a different direction: as CEO of Nabla, a healthcare AI company, he encountered LLM hallucinations in clinical settings where errors could be lethal. His conclusion paralleled LeCun’s: generative models that optimize for plausibility rather than truth are structurally unsuited for high-stakes reasoning. Saining Xie, the chief science officer, built ConvNeXt and DiT at Google DeepMind. His expertise is in representation architectures, not reinforcement learning or robotics. Michael Rabbat, VP of world models, comes from Meta FAIR’s optimization and distributed learning group.
This team composition is itself a signal. The people LeCun chose to build AMI Labs are representation specialists. The question of who will build the action-conditioned planning layer, the part of the blueprint that connects understanding to doing, remains open.
LeBrun was candid about the timeline: AMI Labs starts with fundamental research, not products. It could take years for world models to go from theory to commercial applications. He also offered a prediction with a sharper edge: “world models” will become the next buzzword, and within six months every company will claim the label to raise funding.
The Hardest Questions
LeCun’s thesis is clear, internally consistent, and supported by early evidence. That does not mean it is right. The honest assessment requires confronting five specific challenges.
First, the controlled experiment that does not exist. V-JEPA 2-AC outperformed Octo and Cosmos on robot manipulation. But no published study compares it head-to-head with a generative multimodal model of equivalent scale on the same tasks with the same amount of robot fine-tuning data. The claim that joint embedding representations are qualitatively better for physical reasoning than generative representations has strong theoretical motivation and suggestive empirical results. It does not yet have a definitive experiment.
Second, the blueprint gap. V-JEPA 2 validates phase one: self-supervised representation learning from video produces useful physical understanding. V-JEPA 2-AC validates phase two: action conditioning can be added on top. But these are simple, flat architectures. The hierarchical JEPA that LeCun’s 2022 paper describes, the architecture that would enable multi-scale planning and abstract reasoning, has not been built. This is not a matter of engineering polish. It is the hardest part of the design. Every other tradition in the world model landscape is also struggling with exactly this problem: bridging perception and planning. LeCun’s blueprint says the bridge should be built on top of good representation. The representation is now in hand. The bridge is not.
Third, the question of “good enough.” LeCun’s bet requires that JEPA representations are not just different from generative representations, but qualitatively better for physical reasoning. If generative models trained on video produce representations that are good enough for robot control, even if theoretically suboptimal, then the coupling order does not matter in practice. AMI Labs would have a different training objective but no structural advantage. The bar is not “does JEPA work?” It is “does JEPA work so much better that the difference justifies building a separate company around it?”
Fourth, competitor velocity. Physical Intelligence is accumulating real robot deployment data every month. Google DeepMind is iterating across every road in the world model landscape. These organizations are building experience and datasets that compound over time. If JEPA’s representation advantage takes three years to manifest in a deployed system, the Robot’s Road may have accumulated enough deployment data and engineering expertise that the advantage no longer matters. The clock is running.
Fifth, the “just a foundation model” question. Strip away the architectural vocabulary, and V-JEPA 2’s workflow is: pre-train on massive video data, then fine-tune on a small amount of task-specific data. This is the same workflow that defines every foundation model in the industry. The training objective is different: joint embedding rather than generative. The question is whether this different training objective, by itself, constitutes a sufficient moat. If other labs can achieve comparable physical reasoning with generative pre-training and clever fine-tuning, then AMI Labs is competing on training recipe, not paradigm.
What Would Tell Us the Answer
The honest framing is not “is LeCun right?” That question is currently unanswerable. The useful framing is: what evidence would resolve it, and when should we expect that evidence?
If, within two years, JEPA representations transfer to complex robot manipulation tasks with significantly less robot-specific data than generative alternatives of comparable scale, LeCun is right. The coupling order matters. Understanding first, action second produces better physical intelligence per unit of robot data.
If JEPA representations perform comparably to generative alternatives on the same tasks given the same fine-tuning budget, the coupling order does not matter in practice. AMI Labs may still build a successful company, but not because of a paradigm difference.
If the hierarchical planning layer cannot be made to work, if the bridge from representation to multi-step action proves as hard for JEPA as for everyone else, then the blueprint fails at its most critical joint, regardless of how good the representation is.
LeCun has been proven right before when the consensus said he was wrong. Convolutional networks were dismissed for over a decade before they became the foundation of modern computer vision. His early advocacy for self-supervised learning preceded its current dominance by years. The pattern of his career is to be early, to be stubborn, and eventually to be vindicated. But the pattern is not a guarantee. Being right twice does not mean being right a third time.
What is certain is the scale of the bet. A billion dollars. Four global hubs. A team of representation specialists. And a single thesis: the order in which a machine learns determines how deeply it can understand. If that thesis holds, every other approach to world models is taking the long way around. If it does not, AMI Labs is a very expensive foundation model company with an unusual training objective.
The clock started in March 2026. The answer should arrive before the money runs out.
This article is part of Robonaissance's coverage of the world model research frontier. For the full landscape across all five research traditions in this field, intellectual history and technical foundations behind each tradition, see the six-part series Roads to a Universal World Model.



Brilliant piece thank you. I particularly like the observation by LeCun of how 'world building' is going to be used by everyone in tech as they try to rebuild their moats. Looking forward to the snake oil traders jumping on that...
This is fantastic. Hugo, I likely don't even understand what's wrong with this question, but will attempt it anyway.
Do any of these approaches or problem spaces map onto representations of conceptual knowledge? I have in mind here, institutional knowledge, like the Constitution, or criminal code, or cultural etiquette. Things evolved apes "make up". Is it enough to encode it in language and depend on LLMs, or should there be a layer of actual machine understanding? Like representation + action for robotics?