The Robot’s “ChatGPT Moment”
How large language models gave robots common sense. The RT-2 and VLA breakthroughs at Google DeepMind.
Chapter 6 of A Brief History of Embodied Intelligence.
“I spilled my drink. Can you help?” — The instruction that changed everything
In the spring of 2023, in a kitchen at Google’s robotics lab in Mountain View, a robot faced a simple test.
The kitchen looked ordinary: counters, a sink, drawers, a small table cluttered with snacks. The robot was a mobile manipulator, essentially an arm mounted on a wheeled base, cameras where eyes might be. It had been trained to pick up objects and move them around. Nothing remarkable.
What was remarkable was the instruction.
“I spilled my drink. Can you help?”
The robot paused. Then it rolled to a counter, picked up a sponge, and brought it to the person who had spoken.
No one had trained the robot on spills. No one had written a rule connecting “spilled drink” to “sponge.” The robot had figured it out, not through trial and error, not through explicit programming, but through something that looked, for the first time, like understanding.
“It’s using common sense,” said Karol Hausman, a research scientist who had helped build the system. “It knows what humans know about spills and cleaning. Not because we taught it, but because it learned from language.”
The robot was called RT-2, and it represented something the field had been chasing for decades: a machine that could understand what you meant, not just what you said.
The Missing Ingredient
The last chapter ended with a question: reinforcement learning could teach robots how to achieve goals, but it couldn’t tell them what goals were worth achieving. A robot could learn to set a table through millions of simulated attempts, but only if someone first specified exactly what “set a table” meant: which side the fork goes on, whether to fold the napkin, how close the glass should be to the plate.
Real human instructions aren’t like that. We say “clean up a bit” and expect common sense to fill in the gaps. We say “hand me that thing” and expect the listener to know which thing we mean from context. We say “I spilled my drink” and expect help without specifying what kind of help.
This gap, between what humans say and what they mean, had defeated robotics for decades. Traditional approaches tried to close it through more complete specifications: longer instruction manuals, more detailed rules, better planning systems. These efforts helped, but they couldn’t scale. The real world has too many situations, too many edge cases, too many things that “everyone knows” but no one writes down.
What robots needed was a way to access that unwritten knowledge. They needed common sense.
And then, in late 2022, the world discovered that large language models had been learning common sense all along.
The Strange Discovery
ChatGPT’s release in November 2022 surprised even its creators.
The model had been trained to predict the next word in a sequence. Nothing more. Feed it text, have it guess what comes next, adjust its parameters when it guesses wrong. Repeat trillions of times across most of the text on the internet. The training objective was almost absurdly simple.
But something strange happened along the way. To predict text well, the model had to learn patterns that went far beyond grammar and syntax. It had to learn that spills require cleaning, that birthdays involve cake, that “I’m exhausted” often precedes “I need coffee.” It had to learn the implicit structure of human knowledge, the common sense that permeates everything we write but rarely state explicitly.
No one had trained GPT on a dataset of “things everyone knows.” It had absorbed common sense as a byproduct of learning to predict language.
For roboticists, this was a revelation. The knowledge they had been trying to hand-code for decades, the understanding of objects, situations, human intentions, already existed, encoded in the weights of language models trained on internet text. The question was whether that knowledge could be extracted and connected to physical action.
The Old Way
For decades, robot intelligence was assembled from separate pieces.
A vision system would look at the scene and produce labels: “red apple at coordinates (234, 156), coffee mug at (312, 203).” A language system would parse the instruction “hand me something to eat” and try to match it against the labels. A planning system would calculate a path for the arm to reach the identified object.
Each component was engineered separately. The vision system didn’t know what the robot was trying to do. The language system couldn’t see. They communicated through labels and coordinates, a game of telephone where meaning was lost at every handoff.
This architecture was brittle. If the vision system mislabeled an object, everything downstream failed. If the scene contained details that the vision system didn’t think to report, the material of an object, whether it was damaged, how full a container was, that information was lost before reasoning even began.
And crucially, common sense couldn’t connect to perception. A language model might know that decorative plastic fruit isn’t edible, but if the vision system only reported “apple at coordinates (234, 156)” without mentioning it was plastic, that knowledge was useless. The modules couldn’t share understanding, only labels.
Even early neural network approaches struggled with this separation. Google’s RT-1, released in late 2022, trained a single model on hundreds of thousands of robot demonstrations. It could recognize new objects and follow instructions similar to those in its training data. But ask it to “bring something for a headache” and it would fail, not because it couldn’t physically pick up a bottle of aspirin, but because nothing in its robot-only training data connected headaches to medicine. That knowledge existed on the internet, in billions of documents. RT-1 had no way to access it.
What researchers needed was not better modules, but a fundamentally different architecture, one where vision, language, and action weren’t separate departments, but unified processes that could reason together.
The New Way: RT-2 and Unified Intelligence
The breakthrough came from treating everything as language.
Large language models work by converting text into tokens, discrete units that the model can process and relate to each other. The word “apple” becomes a token. The phrase “something to eat” becomes a sequence of tokens. The model learns relationships between tokens, building a vast web of associations from internet text.
RT-2 extended this principle to vision and action. Images were divided into patches, each patch becoming a token. Robot commands, “move arm left 5 centimeters, rotate wrist 10 degrees,” were converted into tokens. Suddenly, everything lived in the same representation space. To the model, there was no fundamental difference between processing the word “apple,” seeing an image patch containing an apple, and outputting the action to grasp it.
RT-2 built on PaLM-E, a massive model that had already unified vision and language by learning to process image tokens and text tokens together. Google’s researchers took this foundation and fine-tuned it on robot demonstrations, teaching it to output action tokens alongside words. The result was what researchers called a Vision-Language-Action model, or VLA: a single system that sees, understands language, and acts.
The unified representation space enabled something remarkable. When the model encountered the word “apple,” it activated representations learned from internet text, fruit, red, edible, round. When it processed image patches containing an apple, visual representations activated nearby. And when it needed to grasp that apple, the action tokens were generated based on both. Knowledge flowed seamlessly across modalities.
This meant RT-2 could handle situations it had never been trained on.
“Bring me something that would help me wake up.”
The language model knew, from internet text, that people wake up with coffee, energy drinks, cold water. The visual system could identify a Red Bull can on the counter. The action system could generate the tokens needed to grasp it. Three capabilities, one unified reasoning process, and the robot picked up the Red Bull and handed it over.
“Pick up the extinct animal.”
Among a set of plastic toys, the robot selected the dinosaur. It had never seen this toy in training. But the language model knew dinosaurs were extinct. The visual system could match the concept to the plastic figure. The action system could reach for it.
“Hand me the one you’d use to fix this.”
The robot looked at a loose screw on a cabinet door, then at a table with tools: screwdriver, hammer, tape, scissors. A modular system would need explicit labels: “loose screw detected, screwdriver relevant.” RT-2 could reason directly: see the problem, understand “fix,” recognize which tool matched, output the grasp. Vision and language informed each other without intermediate translation.
“Move the banana to the sum of two plus one.”
The instruction combined language understanding with numerical reasoning. RT-2 computed that two plus one equals three, then placed the banana on the spot marked “3” on the table. The boundaries between seeing, thinking, and doing had dissolved.
What Changes When Robots Understand
The demonstrations were impressive, but their significance went deeper than picking up Red Bull cans.
RT-2 represented a paradigm shift from assembling modules to unifying intelligence. Traditional robotics required specifying everything in advance: every object label, every instruction pattern, every action sequence. Each new capability was an engineering project.
RT-2 inherited knowledge directly. Want the robot to understand that “I’m cold” might mean “close the window”? That knowledge was already in the language model. Want it to know that eggs are fragile and require gentle handling? Already there. The robot inherited millennia of accumulated human knowledge, compressed into model weights, accessible through the unified space where words, images, and actions lived together.
This didn’t mean the problems were solved. RT-2 was slow. The language model took seconds to process each instruction, far too slow for dynamic tasks. It was inconsistent. The same instruction might produce different actions on different attempts. It was confined to the lab. Google’s carefully controlled kitchens, not the chaos of real homes.
But it proved that the connection was possible. Robot bodies could be animated by minds trained on human text. The gap between language and action could be bridged.
Inside the Kitchen Labs
Google’s robotics lab occupied a building in Mountain View that looked unremarkable from outside, anonymous tech campus architecture. Inside, it was a strange hybrid of kitchen and factory.
Real kitchens, with real counters and real appliances, sat in rows like exhibits. In each kitchen, a robot arm mounted on a mobile base performed endless repetitions of household tasks. Pick up a sponge. Open a drawer. Move a can from here to there. Cameras recorded every motion, every success, every failure.
The goal was data. Google had learned from its language models that scale matters: more data, more compute, better results. Now they were applying the same philosophy to robotics. Instead of carefully engineering a single robot to perform specific tasks, they were collecting massive datasets that could train models to perform any task.
Karol Hausman, who had joined Google Brain in 2017, had become one of the key figures in this effort. His background was in reinforcement learning, but he had grown convinced that the future lay in combining learned behaviors with the world knowledge embedded in language models.
“The robot doesn’t need to learn from scratch that sponges are for cleaning,” Hausman explained in a presentation. “Billions of humans have already written about that. We just need to connect that knowledge to the robot’s actions.”
Vincent Vanhoucke, who led the broader robotics effort, had an even longer view. He had been at Google since 2007, working on speech recognition before turning to robotics. He saw the current moment as analogous to where speech recognition had been in the early 2000s, poised for a breakthrough that would come from scale and learning rather than hand-engineering.
“Everyone asks when robots will be in homes,” Vanhoucke said. “I think they’re asking the wrong question. The question is: when will robots understand what we want? We’re getting close to answering that.”
The Limits of Language
But understanding what we want isn’t the same as doing what we need.
RT-2 demonstrated that language models could provide robots with common sense. They could not, however, provide robots with physical competence. Knowing that sponges are for cleaning is different from knowing how to wring out a sponge, how much pressure to apply when wiping a surface, how to navigate a cluttered counter without knocking things over.
The Google systems operated in controlled environments, purpose-built kitchens where the lighting was consistent, the objects were familiar, and human researchers stood nearby to catch failures. In these conditions, RT-2 achieved about 60% success rate on novel instructions. That was remarkable for research. It was nowhere near sufficient for deployment.
Speed was another limitation. Large language models require significant computation. Every instruction sent to RT-2 passed through billions of neural network parameters before producing an action. The delay was measured in seconds, acceptable for picking up a sponge, unacceptable for catching a falling glass or responding to a sudden obstacle.
And there were deeper questions about reliability. Language models, for all their capabilities, are inconsistent. They can give different answers to the same question, make confident errors, and fail in ways that are difficult to predict. These properties, merely annoying in a chatbot, could be dangerous in a machine with physical agency.
Google’s researchers were aware of these limitations. Their papers included careful discussions of failure cases and boundary conditions. They positioned their work as research, demonstrations of what was possible, not products ready for deployment.
But outside the lab, the implications were already spreading.
The Race Begins
Google’s robotics papers landed in a field that was ready to ignite.
By 2023, the combination of capable language models and improving robot hardware had attracted attention from across the tech industry. Tesla was building humanoid robots. Startups with names like Figure and 1X and Sanctuary were raising hundreds of millions of dollars. Chinese companies were announcing their own humanoid projects.
Each saw the same opportunity: language models had solved, or were close to solving, the problem of robot understanding. The remaining challenges, physical competence, reliable hardware, manufacturing at scale, were hard but not mysterious. The path to general-purpose robots was, for the first time, visible.
Google had advantages in this race. They had the language models. PaLM and later Gemini were among the most capable in the world. They had the data. Years of robot demonstrations collected in their kitchens. They had the researchers. Some of the best minds in robot learning.
But Google also had disadvantages. The company’s history with hardware products was mixed. Robotics required manufacturing capability that Google lacked. And the company’s culture, optimized for software and services, might not adapt easily to the physical world.
In July 2023, Google merged its robotics team with DeepMind, the AI research lab it had acquired in 2014. The combination brought together DeepMind’s reinforcement learning expertise with the robotics team’s data and hardware experience. It also raised the stakes. DeepMind’s leadership, including Demis Hassabis, now had direct influence over Google’s robot efforts.
The merger signaled ambition. It also signaled uncertainty about the path forward. Research had shown what was possible. Turning possibility into product was another matter entirely.
From Programming to Conversation
Whatever the commercial outcome, RT-2 and its successors had changed something fundamental about how we think about robots.
For sixty years, robots had been programmed. Engineers wrote code that specified, in exhaustive detail, what the robot should do in every situation. The robot was a tool that executed instructions, no more capable of understanding those instructions than a hammer understands why it’s hitting a nail.
The new paradigm was different. You could talk to the robot. You could give it vague instructions and expect reasonable interpretations. You could assume it knew things about the world, not because someone had programmed that knowledge, but because it had inherited the knowledge of human civilization through language.
This wasn’t full artificial intelligence. The robots couldn’t reason like humans, couldn’t adapt to truly novel situations, couldn’t be trusted to operate independently for extended periods. But they had crossed a threshold. They had moved from mechanical execution to something that, in limited contexts, resembled understanding.
“The biggest change isn’t technical,” Hausman reflected. “It’s conceptual. We used to ask: how do we program robots to do X? Now we ask: how do we explain to robots what we want? That’s a completely different question.”
The answer to that question, how to explain to robots what we want, would shape the next phase of the field. And it would attract not just researchers, but entrepreneurs, investors, and eventually the largest companies in the world.
The robot’s ChatGPT moment had arrived. What came next would determine whether it mattered.
Notes & Further Reading
On RT-2: “RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control” (2023) is the key paper. It describes how vision, language, and action were unified into a single VLA model by tokenizing all three modalities. The accompanying blog posts and videos offer accessible introductions.
On PaLM-E: “PaLM-E: An Embodied Multimodal Language Model” (Driess et al., 2023) describes the foundation that RT-2 built upon, a model that learned to process image tokens and text tokens together, enabling robots to reason directly on visual scenes.
On RT-1: “RT-1: Robotics Transformer for Real-World Control at Scale” (2022) documents Google’s earlier approach, training on robot demonstrations without the benefit of large language model knowledge. Comparing RT-1 and RT-2 illustrates the difference that unified intelligence makes.
On Google’s robotics history: Coverage in IEEE Spectrum and Wired documents the evolution from Boston Dynamics acquisition through the current learning-based approach. The story of the 100+ robot fleet is detailed in various Google AI blog posts.
On Karol Hausman and Vincent Vanhoucke: Both have given numerous talks at robotics and machine learning conferences, many available on YouTube. Hausman’s presentations on RT-2 are particularly accessible.
On large language models and common sense: The debate over whether language models truly “understand” anything is ongoing. Gary Marcus and Yann LeCun represent opposing views; their public exchanges provide useful context.
On the Google DeepMind merger: Coverage in The Information and other tech publications provides business context. DeepMind’s own robotics work, including the soccer robots from Chapter 5, is now integrated with the former Google Brain robotics team.
On the limitations of current approaches: The papers themselves are commendably honest about failure cases. For critical perspectives, work by Rodney Brooks and others provides valuable counterbalance to the optimistic narratives.


