Physical AI
When Learning Meets Reality
This is Chapter 14 of A Brief History of Artificial Intelligence.
In a warehouse in Fremont, California, in late 2024, a humanoid robot picked up a box, carried it across the room, and placed it on a shelf. The movement was unremarkable—any human worker could do the same in seconds. But for the engineers watching, it represented decades of frustrated ambition finally realized.
The robot was built by a company called Figure, one of several startups racing to create humanoid robots powered by AI. What made it different from earlier robots wasn’t the hardware—humanoid robots had existed for twenty years. It was the software. The robot was controlled by a neural network that had learned to move not through explicit programming but through training on millions of examples of movement. It had learned to manipulate objects the same way ChatGPT had learned to manipulate words: by finding patterns in vast amounts of data.
Intelligence had learned to read, to write, to reason, to pass professional exams. Now it was learning something more fundamental: how to move through the physical world.
This is the story of physical AI—the attempt to give artificial intelligence a body. It’s a story that begins with spectacular failures, continues through patient progress, and arrives at a moment when the digital is becoming physical. The learning machines are learning to act in reality.
The Moravec Paradox, Revisited
Hans Moravec noticed something strange in the 1980s.
Moravec, a roboticist at Carnegie Mellon, had spent years trying to build robots that could navigate and manipulate the world. The work was frustrating. Tasks that seemed trivially easy for humans—picking up a cup, walking across a room, catching a ball—turned out to be fiendishly difficult for machines. Meanwhile, tasks that seemed hard for humans—calculating complex equations, playing chess, proving theorems—were relatively easy to program.
“It is comparatively easy to make computers exhibit adult-level performance on intelligence tests or playing checkers,” Moravec wrote, “and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility.”
This became known as Moravec’s Paradox, and it haunted robotics for decades. We could build machines that beat grandmasters at Go but couldn’t tie a shoelace. We could build systems that proved mathematical theorems but couldn’t pour a glass of water. Abstract reasoning was easy. Physical competence was impossibly hard.
The reason, Moravec suggested, was evolution. Humans had spent hundreds of millions of years evolving to move through the physical world, to manipulate objects, to perceive and react. These skills were ancient, deeply embedded in our neural architecture, refined by eons of natural selection. Abstract reasoning was newer—a recent evolutionary addition, still somewhat awkward, still requiring conscious effort. We experience reasoning as hard because it is new. We experience movement as easy because it is old.
But “easy for evolution” doesn’t mean “easy to program.” Evolution had found solutions through billions of years of trial and error, encoded in architectures we didn’t understand. Programmers had to start from scratch, trying to specify in code what biology had discovered through selection. The gap was enormous.
This is why physical AI is so significant. Machine learning offers a different path—not programming solutions but learning them. If robots could learn from data the way language models learned from text, they might finally acquire the skills that had eluded explicit programming.
Learning to Move
Sergey Levine knew the problem intimately.
Levine, a professor at UC Berkeley, had spent his career at the intersection of machine learning and robotics. His lab, in a building overlooking the San Francisco Bay, was filled with robot arms—Sawyer arms, Kuka arms, custom-built grippers—each one running learning algorithms, each one trying to figure out how to grasp objects, manipulate tools, and interact with the physical world.
The traditional approach to robot control was laborious. Engineers would manually specify how a robot should move to accomplish a task—trajectory planning, inverse kinematics, carefully tuned controllers. This worked for structured environments like factory assembly lines, where everything was precisely positioned and nothing unexpected happened. It failed for unstructured environments—kitchens, warehouses, homes—where objects could be anywhere and the robot had to adapt.
In 2016, Levine and his colleagues ran an experiment that would become legendary in robotics. They took a collection of robot arms—fourteen of them, running in parallel—and had them try to pick up objects from bins filled with random items: toys, tools, household objects, things of varying shapes and sizes. No one programmed the robots how to grasp. Instead, the robots tried random movements, observed what happened, and gradually learned which movements led to successful grasps.
The scale was unprecedented. Over two months, the robots attempted 800,000 grasps. Day and night, the arms moved, tried, succeeded, failed, and learned. Each attempt generated data: an image of the bin, the movement the robot tried, whether it succeeded in picking something up. Neural networks trained on this data learned to predict which grasps would work given the visual scene. The robots improved from barely functional—grasping successfully perhaps 30% of the time—to reasonably competent, succeeding 80% or more.
“The robot arm learned to pick things up the way a baby learns,” Levine explained to visitors who came to watch the robots train. “Not by being told how, but by trying and failing and trying again. The difference is we could run many robots in parallel, collecting data much faster than a human baby could. And the data was all shared—what one robot learned, all of them knew.”
The approach—learning from experience rather than following programmed rules—transformed what robots could do. Levine’s lab went on to train robots to fold laundry, manipulate tools, open doors, and even learn from watching humans demonstrate tasks. Other labs followed. Pieter Abbeel at Berkeley, Chelsea Finn at Stanford, Dieter Fox at NVIDIA and the University of Washington—a generation of researchers began applying the same deep learning techniques that had revolutionized vision and language to the problem of physical control.
The field of robot learning emerged, and with it a new possibility: robots that could acquire skills the way children did, through experience and observation rather than explicit instruction.
The Sim-to-Real Challenge
There was a catch. Real-world data was expensive.
Training a language model required text, which existed in abundance on the internet. Training an image classifier required pictures, which were also plentiful. But training a robot required physical interaction—actual robots trying actual movements in actual environments. This was slow, expensive, and dangerous. Robots broke things. Sometimes they broke themselves.
The solution was simulation. What if you could train robots in virtual environments, then transfer what they learned to the real world?
The idea was appealing. Simulations could run much faster than real time. You could run thousands of simulated robots in parallel, collecting in hours the data that would take months in reality. Robots couldn’t break things in simulation—or rather, they could break virtual things, learn from it, and reset instantly. The economics were transformative: unlimited training data from virtual worlds.
But simulation had a problem: the real world wasn’t simulated perfectly. Physics engines approximated reality but didn’t capture every detail. Virtual surfaces had different friction than real surfaces. Virtual light behaved differently than real light. A policy that worked perfectly in simulation might fail catastrophically when transferred to reality. The gap between simulation and reality was called the “sim-to-real” problem, and it bedeviled robot learning.
Researchers attacked the problem from multiple angles. One approach was to make simulations more realistic—better physics, better rendering, better modeling of the real world. Another approach was domain randomization: deliberately varying the simulation randomly, training robots to handle a wide range of conditions so that reality fell somewhere within the training distribution. A third approach was to fine-tune in the real world, using simulation for initial training and then adapting with real data.
By the early 2020s, sim-to-real transfer was working well enough for practical applications. Robots trained primarily in simulation could navigate real environments, manipulate real objects, and perform useful tasks. The gap hadn’t closed completely—robots still sometimes failed when reality differed from simulation—but it had narrowed enough to enable progress.
The Humanoid Moment
Humanoid robots had been a dream since the beginning of robotics. The vision was compelling: a robot shaped like a human could use human tools, navigate human spaces, perform human tasks. It wouldn’t need a specialized environment; it could work alongside people in the world we had already built.
Honda’s ASIMO, introduced in 2000, could walk on two legs and climb stairs. It became a celebrity, appearing at trade shows and giving demonstrations that delighted audiences. But ASIMO was a showcase, not a product. It cost millions of dollars. It required a team of engineers to operate. Its capabilities, while impressive for the time, were narrowly scripted—it could perform choreographed routines but couldn’t adapt to unexpected situations.
Boston Dynamics pushed further. Their Atlas robot, introduced in 2013 and refined over the following decade, could do things that seemed impossible: run through forests, do backflips, perform parkour across obstacle courses. Videos of Atlas went viral—millions watched in amazement as a robot leaped and tumbled with animal grace. But Atlas was also a demonstration platform, not a commercial product. Each robot cost hundreds of thousands of dollars and required constant maintenance. Boston Dynamics was a research lab masquerading as a company.
Then Tesla announced Optimus.
In 2022, at Tesla AI Day, Elon Musk unveiled a humanoid robot that Tesla planned to manufacture at scale. The announcement was met with skepticism—the prototype that shuffled awkwardly across the stage was far less impressive than Atlas. It could barely wave. Some observers dismissed it as vaporware, another Musk announcement that would never materialize.
But Musk’s argument was about economics, not current capability. Tesla knew how to build things at scale—they had manufactured millions of cars. Tesla had AI expertise from years of work on self-driving technology. Tesla had battery expertise, motor expertise, manufacturing expertise. A humanoid robot, mass-produced the way cars were, might cost less than a car. Not a million-dollar research platform, but a practical product in the price range of a vehicle.
Suddenly, humanoid robots were a race. Figure, founded by Brett Adcock, a serial entrepreneur who had previously built an air taxi company, raised billions of dollars to build humanoid workers. The pitch was specific: robots that could work in warehouses and factories, performing tasks too unstructured for traditional automation. Agility Robotics, which had been quietly building bipedal robots for years at a startup spun out of Oregon State University, accelerated its plans and began deploying robots in Amazon warehouses. Chinese companies like Unitree entered the market with surprisingly capable systems at aggressive price points.
By 2024, more humanoid robot companies existed than at any previous time in history. The race was on.
What changed? The hardware had improved—better motors, better sensors, better batteries. Manufacturing had matured. But the crucial change was AI. Neural networks could learn to control humanoid robots the same way they had learned to control simpler systems. The learning approach that had enabled robot arms to grasp objects could enable humanoid robots to walk, balance, and manipulate. The software that had been the bottleneck was becoming the enabler.
Foundation Models for Robots
The transformer revolution that had created GPT and ChatGPT came to robotics.
The idea was straightforward: if language models could learn general capabilities from internet text, could robot models learn general capabilities from robot data? Instead of training narrow policies for specific tasks—one model for grasping, another for walking, another for navigation—could you train a single model that understood physical interaction broadly?
In 2023, Google DeepMind introduced RT-2 (Robotic Transformer 2), a model that combined the capabilities of vision-language models with robot control. You could tell RT-2 what to do in plain English—”pick up the apple and put it in the bowl”—and it would do it, even if it had never seen that specific task during training. The model had learned something general about language, vision, and physical manipulation that transferred to new situations.
Other groups followed. Stanford’s Mobile ALOHA project showed that robots could learn complex manipulation tasks by watching humans demonstrate them, then generalize to new situations. Figure announced that its robots used language models for planning and control, enabling them to understand verbal instructions and reason about how to accomplish tasks.
The vision emerging was ambitious: foundation models for the physical world. Just as GPT learned general language capabilities from text, future systems might learn general physical capabilities from robot data. A single model might understand how to walk, grasp, manipulate, navigate—transferring knowledge across tasks the way language models transfer knowledge across domains.
This was still more vision than reality. Robot foundation models in 2025 were far less capable than language models—the field was perhaps where language models had been around 2018, impressive but limited. But the trajectory was clear. The techniques that had transformed AI through text were now being applied to physical interaction. The learning machines were learning to inhabit bodies.
The Data Problem
The challenge that remained was data.
Language models trained on the internet—billions of words of text, accumulated over decades. Where would physical AI get equivalent data? Robots hadn’t been recording their movements at internet scale. There was no YouTube of robot actions, no Wikipedia of physical manipulation.
Some researchers looked to simulation—generating unlimited synthetic data in virtual worlds. Others looked to videos of humans—if you could watch humans performing tasks, perhaps you could extract the knowledge needed for robots to do the same. Still others advocated for robot “fleets”—deploying thousands of robots into homes and warehouses, collecting data from their interactions with the real world.
The data bottleneck might prove temporary. As robots become more capable and more numerous, they will generate more data about physical interaction. That data will enable better robots, which will generate more data. The flywheel that accelerated language models—more data enabling better models enabling more use generating more data—might accelerate physical AI.
Or the bottleneck might prove persistent. Physical interaction might simply be harder to learn than language. The real world might be too diverse, too complex, too unforgiving of mistakes. Language models can hallucinate without consequence—wrong text is just wrong text. Robots that hallucinate might hurt people or destroy property. The stakes are different when AI has a body.
Closing: The Digital Becomes Physical
This chapter began in a warehouse, watching a robot pick up a box. It ends with a transformation in progress.
For years, AI was purely digital—manipulating symbols, processing information, existing in the abstract realm of computation. Language models could write about running but couldn’t run. Vision systems could recognize objects but couldn’t touch them. Intelligence lived in servers, experiencing the world only through the proxy of data.
Physical AI changes this. Intelligence is acquiring bodies—robot arms, humanoid forms, autonomous vehicles, drones. The learning machines are learning to move through the world, to manipulate objects, to act rather than just think. The boundary between digital and physical is dissolving.
The implications extend beyond robotics. Physical AI could transform manufacturing, logistics, healthcare, and domestic life. Robots that can learn to perform physical tasks might do work that humans find tedious, dangerous, or physically demanding. They might care for the elderly, assist the disabled, explore environments too hostile for humans. The possibilities are vast.
But so are the challenges. Physical systems can cause physical harm in ways that language models cannot. A misaligned robot is more dangerous than a misaligned chatbot. The alignment problems discussed in earlier chapters become more urgent when AI can act in the world, not just speak about it.
Intelligence learned to read. It learned to write. It learned to reason. Now it’s learning to move, to touch, to act in the physical world. The digital is becoming physical. The learning machines are learning to be in the world, not just to process information about it.
What that will mean—for work, for society, for what it means to be human in a world of capable machines—remains to be seen. But the transition is underway. The age of disembodied AI is ending. The age of physical AI has begun.
Notes and Further Reading
On the Moravec Paradox
Hans Moravec articulated the paradox in his book “Mind Children” (1988). The observation that sensorimotor skills are harder to automate than abstract reasoning has shaped robotics research for decades. Understanding why—the evolutionary depth of physical capabilities—illuminates both the challenge and the approach of learning-based robotics.
On Robot Learning and Levine’s Work
Sergey Levine’s research at UC Berkeley pioneered large-scale robot learning from experience. The 2016 paper on grasping with 800,000 attempts demonstrated what was possible with sufficient data. His subsequent work on robot learning from demonstration, learning from videos of humans, and learning in simulation has continued to advance the field.
On Sim-to-Real Transfer
The sim-to-real problem and its solutions are documented in papers from multiple labs. Domain randomization, progressive transfer, and realistic simulation are all active research areas. The gap between simulation and reality is narrowing but remains a practical challenge for robot deployment.
On Humanoid Robots
The current wave of humanoid robot development—Tesla Optimus, Figure, Agility Robotics, and others—represents a significant shift in the field. Economic arguments about manufacturing at scale combine with technical arguments about general-purpose robot bodies. Whether humanoid robots will become practical products remains an open question.
On Foundation Models for Robotics
Google DeepMind’s RT-2 and related projects represent early steps toward general-purpose robot foundation models. The idea of training single models that understand physical interaction broadly, rather than narrow policies for specific tasks, is an active research direction with significant potential and significant remaining challenges.


