Brains
Teaching robots through trial and error, the simulator revolution, and the beginnings of robot “imagination.”
Chapter 5 of A Brief History of Embodied Intelligence
“There was maybe a dozen people in the world who cared about this stuff. We’d meet at conferences and wonder if anyone would ever take us seriously.” — Richard Sutton
The robot hand had been failing for three days straight.
It was October 2019, and in a corner of OpenAI’s San Francisco headquarters, a five-fingered mechanical hand sat mounted to a table, a Rubik’s cube resting in its palm. Every few seconds, the fingers would twitch, attempting to rotate a face of the cube. Most attempts failed. The cube slipped, tilted, or simply didn’t move at all. A researcher named Marcin Andrychowicz watched the clumsy performance with something like paternal patience.
“It’s actually going well,” he told a visitor. “Watch.”
The hand fumbled. The cube nearly fell. Then, slowly, a face rotated into place.
What made this unremarkable manipulation remarkable was its origin. No one had programmed those finger movements. No human had demonstrated how to grip a Rubik’s cube or rotate its faces. The hand had figured it out entirely on its own, through trial and error.
But not trial and error in the physical world. That hand had never touched a Rubik’s cube before arriving in the lab. Everything it knew, every subtle adjustment of grip pressure, every recovery from a near-drop, it had learned in simulation, practicing in thousands of virtual worlds simultaneously, accumulating the equivalent of 10,000 years of experience before touching a single physical object.
The researchers called it sim-to-real transfer. It would prove to be the key that unlocked robot learning.
The Keeper of the Flame
To understand why this mattered, you have to understand how unlikely it was.
For most of the history of artificial intelligence, the idea that robots could learn through trial and error was considered a dead end. The mathematics was elegant, the results were meager, and the researchers who believed in it were increasingly marginalized.
Richard Sutton was the keeper of the flame.
Sutton had fallen in love with reinforcement learning as a graduate student at the University of Massachusetts in the late 1970s. The core idea captivated him: instead of telling a machine exactly what to do, you tell it what success looks like, and let it figure out the rest through experimentation. It was how dogs learned tricks, how children learned to walk, how evolution shaped behavior over millions of years. Sutton believed it was fundamental to intelligence itself.
This was different from supervised learning, learning from labeled examples, which would later power breakthroughs like the Berkeley grasping experiments. Those robots attempted thousands of grasps, recorded which ones succeeded, and trained a neural network to predict success from images. That worked well for single-action decisions like “will this grasp succeed?” But reinforcement learning aimed at something harder: choosing actions in sequences where early choices affect later possibilities. Playing a game requires thousands of decisions, each affecting what comes next. Solving a Rubik’s cube requires dozens of grip adjustments in sequence. For these problems, you need a system that learns what to do, not just what will happen.
His colleagues were skeptical. Reinforcement learning had struggled for years to produce results that matched its theoretical elegance. The algorithms were slow, the problems they could solve were tiny, and funding agencies had lost patience. Neural networks, a closely related approach, had been declared dead in 1969 after Minsky and Papert’s critique of perceptrons delivered the final blow to an already struggling field. The AI mainstream had moved on to expert systems, symbolic reasoning, logic-based approaches. Learning from trial and error was considered naive.
Sutton didn’t care. He spent the 1980s developing the mathematical foundations of reinforcement learning. He proved theorems about convergence. He invented temporal difference learning, a way to learn from predictions rather than waiting for final outcomes. Instead of playing an entire game of chess and only then learning whether your moves were good, you could update your estimates after every single move, comparing what you expected to happen with what actually happened. The technique would later power many breakthroughs. He and his colleague Andrew Barto wrote a textbook that no one assigned because no one taught the subject.
“There was maybe a dozen people in the world who cared about this stuff,” Sutton recalled in a later interview. “We’d meet at conferences and wonder if anyone would ever take us seriously.”
The problem was practical. Reinforcement learning required enormous amounts of trial and error, millions of attempts to learn even simple tasks. In simulation, this was merely slow. In the real world, with real robots, it was impossible. Motors wore out. Objects broke. Time passed. A robot learning through physical trial and error might need years to master what a human could learn in minutes.
For three decades, Sutton kept working, kept publishing, kept believing. He was, as one colleague put it, “the most stubborn man in AI.”
Then, in 2013, vindication arrived from an unexpected direction.
The Arcade Awakening
DeepMind was a startup of twenty people in London when they cracked Atari.
The company’s founder, Demis Hassabis, had been a chess prodigy, a video game designer, and a neuroscientist before deciding to pursue artificial general intelligence. His team included some of the few researchers in the world who still believed in neural networks and reinforcement learning. In early 2013, they built a system called DQN, Deep Q-Network, and pointed it at old Atari games.
The system received only raw pixels from the screen and the game score. It had no concept of what a ball was, what a paddle was, what it meant to “lose a life.” It had to discover everything through trial and error, playing millions of games, gradually learning which actions led to higher scores.
On Breakout, DQN discovered a strategy that surprised its creators. After learning the basics, hit the ball, don’t miss, it found a way to tunnel through the side of the brick wall, getting the ball trapped on top where it would rapidly clear rows without human intervention. No one had programmed this strategy. The machine had discovered it on its own.
On Space Invaders, Pong, and dozens of other games, DQN achieved superhuman performance using the same algorithm and architecture. A single system, learning from scratch, could master wildly different games.
The key innovation was combining reinforcement learning with deep neural networks. Previous systems used simple, hand-crafted representations of game states. DQN learned its own representations, teaching itself to see, in effect, at the same time it learned to act. It was everything Sutton had been working toward, finally made real.
Google acquired DeepMind in January 2014 for over $500 million. Three years later, DeepMind’s AlphaGo defeated the world champion at Go, a game long considered beyond the reach of machines. Sutton’s stubborn faith had been rewarded beyond anyone’s imagination.
But games were one thing. The physical world was another.
The Reality Gap
Video games are perfect simulations.
When DQN plays Breakout, the ball bounces exactly as the game’s code specifies. The physics is deterministic. There are no surprises, no variations, no gap between how the game works and how the game says it works.
The physical world offers no such guarantee.
Every robot simulator makes approximations. Real motors have friction, backlash, and response delays that simulators model imperfectly. Real objects have weights and textures that vary from unit to unit. Real environments have air currents, temperature changes, lighting variations, and a thousand other factors that simulations typically ignore.
This mismatch, the reality gap, had plagued robotics for decades. The story was always the same: a researcher would train a beautiful behavior in simulation, only to watch it fail spectacularly on real hardware. The simulated robot had learned to exploit quirks of the simulator rather than developing genuinely robust skills. It was like learning to drive in a video game and expecting to handle a real car in a rainstorm.
By 2015, most robotics researchers had concluded that sim-to-real transfer was a mirage. Simulation was useful for testing ideas, but real learning required real robots. The gap couldn’t be bridged.
OpenAI decided to try anyway.
The Counterintuitive Solution
The breakthrough came from asking the wrong question correctly.
The obvious approach to the reality gap was to make simulators more accurate. Model the friction better. Capture the motor dynamics more precisely. Reduce the gap by improving the simulation.
OpenAI tried a different approach: what if, instead of making simulation more like reality, you made it deliberately more varied than reality?
The idea was called domain randomization, and it inverted the conventional wisdom. Instead of training in a single, carefully calibrated simulation, you trained in thousands of different simulations, each with randomly varied parameters. The friction coefficient might be 0.3 in one simulation and 0.7 in another. The cube might weigh 80 grams in one and 150 grams in the next. The lighting might be bright, dim, colored, flickering.
A robot trained in this chaos couldn’t rely on any specific parameter. It had to learn behaviors that worked regardless of the friction, the weight, the lighting. It had to become robust to variation itself.
And here was the key insight: reality is just another variation. If a robot can handle friction ranging from 0.2 to 0.8, it can handle whatever friction the real world happens to provide. If it can manipulate cubes weighing anywhere from 50 to 200 grams, it can handle a real cube weighing 90 grams. The robot doesn’t need to know what reality is like. It just needs to be prepared for anything.
It was like training a tennis player not on a perfect court, but on courts with random slopes, uneven surfaces, unpredictable winds. When they finally played on a normal court, it would feel easy.
Ten Thousand Years in a Month
The Rubik’s cube project was deliberately chosen to be difficult, but not in the way most people assume.
The intellectual challenge of solving a Rubik’s cube is well understood. Algorithms exist that can determine the optimal sequence of moves for any scrambled cube. The hard part, for a robot, was physical: how do you actually execute those moves with a single five-fingered hand?
Human speedcubers use two hands, rapidly rotating faces and flipping the cube between grips. A single robot hand has to do everything: grip the cube, rotate one face, reposition to access another face, maintain control throughout. Each face rotation might require multiple grip adjustments. The hand has to recover smoothly when the cube starts to slip.
OpenAI wasn’t training the hand to figure out which moves to make. A conventional algorithm provided that. They were training it to execute those moves physically, the dexterous manipulation that looks trivial when humans do it but had never been achieved by a robot learning from scratch.
The hand trained in simulated environments where nearly everything varied randomly. The cube’s size fluctuated by up to 15 percent. Its mass ranged from 20 to 500 grams. The friction between fingers and cube faces varied wildly. Random forces pushed the fingers during manipulation. The lighting changed colors. The table surface tilted unexpectedly.
And the hand practiced relentlessly. Thousands of simulated hands, running in parallel on massive GPU clusters, attempted manipulation after manipulation. Every failure was data. Every success refined the policy. The simulation ran continuously for months.
The final tally was almost incomprehensible: the equivalent of 10,000 years of practice. If a human had started practicing Rubik’s cube manipulation at the founding of Rome, they would not yet have accumulated as much experience as OpenAI’s simulation generated in a few months.
When the policy transferred to the physical hand, the result was uncanny. The robot manipulated the cube with movements no one had programmed. It recovered from disturbances no one had anticipated. Researchers would poke the cube mid-solve, and the hand would adjust smoothly, finding a new grip, continuing the rotation. It succeeded about 60 percent of the time, not perfect, but genuine dexterous manipulation learned entirely from simulated experience.
The Rubik’s cube was a proof of concept. Sim-to-real transfer of complex manipulation was possible. The question now was whether the same approach could scale to entire bodies.
Robots Playing Soccer
In 2022, DeepMind released videos that looked like science fiction.
Small humanoid robots, about two feet tall, were playing soccer against each other. Real robots, on a real field, with a real ball. They ran, they kicked, they blocked shots. When they fell down, they got back up. When the ball bounced unpredictably, they adjusted.
None of this behavior had been programmed. Every skill emerged from reinforcement learning in simulation.
The robots had trained entirely in virtual worlds, accumulating the equivalent of years of experience playing against simulated opponents. They had learned to walk before they learned to run. They had learned to kick before they learned to score. They had learned to play as individuals before they learned to function as teams.
When the policies transferred to physical robots, the researchers held their breath. Walking is hard. Running is harder. Playing soccer, with its constant balance challenges, unpredictable ball dynamics, and opponent interactions, seemed almost impossible to transfer from simulation.
It worked. The robots played genuine games. They scrambled for loose balls. They blocked shots. They celebrated goals, or at least, their programmers celebrated while the robots reset for the next play.
The soccer robots weren’t going to compete with humans. Their movements were jerky. Their strategies were primitive. A reasonably athletic child could have beaten them easily. But they demonstrated something profound: complex, dynamic, whole-body control could be learned in simulation and transferred to reality. The techniques that worked for hands could work for entire bodies.
The Limits of Trial and Error
By 2023, reinforcement learning had proven itself for robotics in ways that would have astonished researchers a decade earlier. Robots could learn to manipulate objects, walk, run, and play games through simulated trial and error. The reality gap had been bridged.
But there was a limitation that no amount of simulation could address.
Consider a robot learning to help in a kitchen. Reinforcement learning requires a reward function, a mathematical definition of success. For solving a Rubik’s cube, the reward is clear: the cube is either solved or it isn’t. For helping in a kitchen, the reward is... what, exactly?
Should the robot wash every dish immediately, or wait until the sink is full? Should it store leftovers in the fridge, or throw them out? When the owner says “clean up,” does that include wiping the counters? Does it include taking out the trash? Does it include feeding the cat?
These questions can’t be answered by trial and error. They require understanding what humans want, not just the explicit instruction, but the implicit context, the social norms, the common sense that humans take for granted.
Reinforcement learning can discover how to achieve a goal once the goal is clearly specified. It cannot determine what goals are worth achieving. The robot needs a human to specify every detail of every task, and in the real world, the range of possible tasks is effectively infinite.
What robots needed wasn’t just better learning algorithms. They needed something like common sense, an understanding of what humans mean, not just what they say.
The Dream of World Models
Some researchers believed the answer was to give robots imagination.
Standard reinforcement learning is model-free: the robot tries actions and observes results without building any internal understanding of how the world works. The sim-to-real successes we’ve seen, the Rubik’s cube, the soccer robots, used this approach. They learned effective behaviors through massive practice, but the robots don’t “understand” physics. They can’t predict what will happen in a new situation; they can only react based on patterns they’ve seen before. It’s like learning to play pool by just hitting balls until something goes in, without ever thinking about angles or momentum.
Model-based approaches take a different path. The robot first learns a “world model,” an internal simulation of how things work, and then uses that model to plan actions mentally before trying them physically. Instead of hitting thousands of pool balls, you imagine what would happen, then take the shot you predicted would work best.
The appeal is efficiency. A mental simulation is infinitely cheaper than a physical experiment. A robot with a good world model could reason through novel situations without needing millions of trials for each new task.
The vision went further: what if a robot learned enough about physics, about objects, about how the world works, that it could handle genuinely new situations? Not through task-specific training, but through general understanding? A robot that knew cups hold liquid and knives are sharp and doors swing on hinges could figure out how to behave in a new kitchen without ever having trained specifically for kitchen tasks.
This was the dream of foundation models for robotics, systems that learned enough about the physical world to generalize without limit.
By 2020, the dream remained largely that. World models worked for simple domains, video games, basic physics simulations, but broke down when confronted with the complexity of the real world. Building a model that could accurately predict everything that might happen in a kitchen was, as one researcher put it, “as hard as building a perfect kitchen simulator, which is as hard as understanding everything about kitchens.”
The dream of world models would not die. It would return, years later, as a central ambition of the foundation model era. But in 2020, it was still waiting for its moment.
The Unexpected Key
Then a strange thing happened. A technology developed for a completely different purpose turned out to solve part of the problem, though not the part the previous section described.
Large language models, trained on nothing but text, had somehow absorbed common sense. They knew that spills need sponges, that screwdrivers fix screws, that “I’m tired” means someone might want coffee. This was not a world model. It was not physical understanding. But it was something roboticists had been trying to hand-code for decades, and it turned out to be the key that unlocked the next phase of the field.
Notes & Further Reading
On Richard Sutton and the history of reinforcement learning: Sutton and Barto’s Reinforcement Learning: An Introduction (2nd edition, 2018) is the definitive textbook and a chronicle of the field’s development. Sutton’s “The Bitter Lesson” essay (2019) captures his hard-won perspective on what works in AI.
On DeepMind and DQN: Mnih et al.’s “Playing Atari with Deep Reinforcement Learning” (2013) launched the deep RL revolution. The documentary AlphaGo (2017) tells the story of the Go match with remarkable access.
On the Rubik’s cube project: OpenAI’s technical reports “Learning Dexterous In-Hand Manipulation” (2018) and “Solving Rubik’s Cube with a Robot Hand” (2019) provide details. The blog posts accompanying each paper are more accessible. Note that the robot was given the solution sequence by a standard algorithm. The challenge was physical manipulation, not puzzle-solving.
On the soccer robots: DeepMind’s “From Motor Control to Team Play in Simulated Humanoid Football” (Liu et al., 2022) describes the technical approach. The accompanying videos are worth watching for entertainment value alone.
On world models: David Ha and Jürgen Schmidhuber’s “World Models” (2018) introduced the concept to a broad audience. Hafner et al.’s Dreamer series (2019-2023) demonstrated that learned world models could achieve competitive performance on benchmark tasks. LeCun’s “A Path Towards Autonomous Machine Intelligence” (2022) presents an influential vision for how world models might develop.
On the limits of reinforcement learning: Russell’s Human Compatible (2019) discusses the reward specification problem and its implications. Amodei et al.’s “Concrete Problems in AI Safety” (2016) catalogs the challenges of specifying what we actually want.


