The RL Spiral, Part 7: RL Meets the Physical World

RL can master any game in days. Teaching a robot to fold a towel is harder than teaching it to play Go. The physical world is where RL’s deepest problems live.

Apr 01, 2026

This is the seventh article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, The World Inside, showed why the best RL researchers are building agents that imagine before they act. This one is about what happens when those agents get bodies.

RL’s path to the physical world was long and mostly paved with failure.

For decades, robotics belonged to control engineering: hand-built physics models, hand-tuned controllers, no learning required. The reason was practical. An RL agent learning to walk needs to fall. It needs to fall many times, trying different strategies, receiving negative reward for each failure, gradually discovering what keeps it upright. In simulation, this is free. In the physical world, it is expensive. Falling wears out joints. Falling breaks sensors. Falling takes time. A robot that needs a hundred thousand falls to learn a walking policy will spend months falling and consume hardware in the process. Control engineers, armed with well-understood physics and fast optimization algorithms, could produce a walking robot in weeks of design work rather than months of trial and error.

This was not a failure of RL. It was a consequence of the sample efficiency gap, the same gap Article 3 traced to the curse of dimensionality. RL in robotics kept not working because the number of real-world trials required was too large, the cost per trial was too high, and the diversity of conditions the robot encountered was too unpredictable for the narrow policies RL produced.

What changed was simulation. GPU-accelerated physics engines made it possible to run thousands of virtual robots in parallel, each experiencing different conditions, accumulating years of simulated experience in hours of wall-clock time. The sim-to-real transfer techniques Article 4 described for Atari games were adapted for physical robots: train in simulation with randomized parameters, then deploy to real hardware without further tuning.

In February 2025, the payoff arrived. Two papers appeared in Science Robotics within weeks of each other. Both described humanoid robots that could walk, recover from shoves, and traverse uneven terrain. Both had been trained entirely in simulation and transferred to real hardware without further tuning. The first, from Google DeepMind, deployed a small humanoid called OP3 that not only walked but played one-on-one soccer, stringing together locomotion, recovery, and strategic decisions in real time. The second, from a team including researchers at UC Berkeley, trained a full-sized humanoid that walked outdoors for a full week without a single fall, handling slopes, gravel, and unexpected obstacles. Neither robot had ever touched real ground during training. Both walked successfully on their first real-world deployment.

The training process itself is worth understanding, because it reveals how RL produces behaviors that hand-designed controllers cannot. In a typical pipeline, the system creates thousands of copies of the robot in a simulated physics engine. Each copy inhabits a slightly different version of the world: different floor friction, different joint stiffness, different sensor noise, different body mass. The copies train simultaneously, each one falling and recovering and falling again, each one generating experience that feeds the same neural network. Over hours of simulated time, the network learns a policy that works not in one specific environment but across the entire distribution of environments it has experienced. The policy is robust because it was forged in diversity. A hand-designed controller encodes what one engineer knows about one set of conditions. An RL-trained policy encodes what ten thousand simulated robots learned across ten thousand variations of reality.

Article 6 described DayDreamer, where a quadruped robot learned to walk from scratch in the real world in roughly one hour by building a world model online and planning inside it. That result points to the next step beyond sim-to-real transfer: robots that continue learning after deployment, updating their internal models from real-world experience rather than relying entirely on what simulation taught them. The sim-to-real gap does not disappear. It gets corrected, in real time, by the robot’s own experience.

The advantage RL brings to robotics is not precision on known terrain. Control engineering handles that well. The advantage is adaptability to unknown terrain. The RL-trained walker that encountered gravel for the first time during its outdoor test had never been given a model of gravel. But it had been trained across thousands of randomized simulation environments with different friction values, different slopes, different ground compliance. It had learned not a single walking strategy but a repertoire of strategies, and the ability to select among them based on what the sensors reported. The gravel was new. The general pattern of adapting to unstable footing was not. This is the same representational advantage Article 3 described in the toddler: compressed experience that transfers to novel situations.

Figure’s humanoid walks with RL. Its locomotion policy was trained entirely in simulation, with no hand-crafted control laws. Thousands of virtual robots, each with randomized physical parameters, learning in parallel, with the resulting policy transferred directly to real hardware. The walking has a quality that hand-designed controllers often lack: naturalness. The RL-trained gait includes heel strikes, toe-offs, and arm swing synchronized with leg movement. Nobody programmed the arm swing. The RL algorithm discovered it because coordinated arm movement helps with balance, and the reward signal encouraged not falling. The naturalness is a side effect of optimization, not a design goal. Ten Figure 02 robots running on the same RL neural network walked in formation, with no per-robot adjustments. The same policy, the same weights, deployed identically across a fleet. Even Boston Dynamics, which built Atlas entirely on hand-designed control, began adopting learned policies for manipulation in 2025.

In 1999, the neuroscientist Kenji Doya published a framework that mapped different forms of learning onto different brain structures. The framework was specific and grounded in anatomical evidence: the basal ganglia, a set of deep brain structures that include the regions where dopamine neurons reside, implement reward-based learning, structurally equivalent to reinforcement learning. The cerebellum, the densely folded structure at the back of the brain containing more neurons than all other brain regions combined, implements supervised learning, using error signals from the difference between predicted and actual sensory outcomes. The cerebral cortex implements unsupervised learning, extracting statistical regularities from experience.

The mapping that matters for this article is between the basal ganglia and the cerebellum. Article 2 showed that the basal ganglia run the brain’s reward prediction system, the dopamine-driven TD learning that is the mathematical core of RL. The cerebellum does something different. It does not learn what is rewarding. It learns what is accurate.

When you first learn a motor skill, the learning is slow, effortful, and reward-driven. A child learning to ride a bicycle falls, corrects, falls differently, corrects again. The process is governed by the basal ganglia: try something, observe the outcome, adjust the policy. This is reinforcement learning. It is flexible, general, and slow. The child is not just learning to ride a bicycle. The child is learning what balance feels like, what overcorrection costs, what the relationship is between handlebar angle and trajectory. The learning is exploratory. Many strategies are tried. Most fail. The few that work are reinforced.

As the skill becomes automatic, the cerebellum takes over. The cerebellum does not learn from reward. It learns from prediction error in a different sense: the difference between what the motor system predicted would happen and what actually happened. If your hand overshoots a target by two centimeters, the cerebellum registers the discrepancy and adjusts the next movement by a precisely calibrated amount. The adjustment is fast, operating within tens of milliseconds, and it operates below conscious awareness. You do not decide to correct your reach. The correction happens automatically, based on a predictive model the cerebellum has built from repeated practice. An experienced cyclist does not think about balance. The cerebellar model handles it, making hundreds of micro-corrections per second that the rider never notices.

The transition is sequential. Early learning is basal ganglia. Automatized performance is cerebellar. The RL phase explores and discovers what works. The cerebellar phase compiles that knowledge into a fast, predictive controller that executes without deliberation. Damage to the basal ganglia impairs the ability to learn new skills. Damage to the cerebellum impairs the execution of skills already learned. The two systems serve different phases of the same process.

Both structures are ancient. The basal ganglia and the cerebellum are present in all vertebrates, from fish to humans. A lamprey, one of the most primitive living vertebrates, has recognizable versions of both. The training-to-deployment pipeline, learning a new behavior through reward-driven exploration and then compiling it into fast automatic execution, has been operating for over 500 million years. The prefrontal cortex, the planning system that enables deliberate model-based reasoning, arrived much later and expanded dramatically only in primates. Evolution solved reliable motor execution long before it solved flexible planning, just as robotics solved precise low-level control through engineering long before RL could contribute high-level learning. The sequence is the same.

The first phase corresponds to training. The second corresponds to deployment.

This is not a metaphor for what is happening in robotics. It is the same computational problem, solved in a different substrate. The RL training phase, where the simulated robot falls thousands of times to discover a walking strategy, corresponds to basal ganglia learning: reward-driven, exploratory, slow. The deployed controller, which executes the learned policy in real time on the physical robot, making fast corrections based on sensor feedback without re-exploring, corresponds to cerebellar control: predictive, fast, automatic. The sim-to-real transfer that every RL robotics paper describes is the silicon version of the brain’s transition from basal ganglia to cerebellum.

The practical frontier of RL in robotics is defined by three problems, each a specific manifestation of the themes this series has traced.

The first is sim-to-real transfer: the curse of dimensionality expressed as an engineering problem. Article 3 described how the number of possible states grows exponentially with the number of dimensions. A simulated robot and a real robot occupy the same state space in principle. In practice, the simulation omits dimensions: the texture of a gripper pad after six months of use, the thermal expansion of an actuator in summer, the imperceptible slope of a warehouse floor. Each omitted dimension is a place where the policy trained in simulation will encounter something it has never seen. Domain randomization, the technique of varying simulation parameters randomly during training, helps. It does not eliminate the gap. It makes the gap narrower and more predictable.

One of the most dramatic demonstrations of sim-to-real transfer remains the OpenAI Rubik’s cube result from 2019, described in Article 3: a robotic hand that solved a Rubik’s cube one-handed after the equivalent of thirteen thousand years of simulated experience. The technique worked, but the limits were visible. The hand completed a full solve only about twenty percent of the time for the hardest scrambles. The gap between simulation and reality showed up in the margin between success and failure. Every percentage point of real-world reliability requires another layer of simulated experience. The numbers work for a research demonstration. They do not yet work for a factory that needs the robot to succeed ninety-nine percent of the time.

The second problem is manipulation. Locomotion is hard, but the relevant physics is relatively constrained: rigid-body dynamics, ground contact forces, balance. Manipulation introduces a universe of complexity. A gripper closing on a cardboard box applies force that depends on the box’s moisture content, the angle of approach, the wear on the gripper’s rubber pads. A hand folding a towel encounters deformable geometry that changes shape with every grasp. A spoon stirring soup creates fluid dynamics that no current simulation models accurately. The state space of a robot hand interacting with a soft object is vastly larger than the state space of a robot walking on a floor.

This is where the sample efficiency gap is widest and most economically consequential. A warehouse robot that can sort rigid boxes is useful. A household robot that cannot handle the diversity of objects in a kitchen is not. And the diversity is the point. A kitchen contains objects made of metal, glass, plastic, ceramic, wood, fabric, and paper. Some are rigid. Some are fragile. Some are wet. Some are hot. Each material interacts differently with the gripper. Each interaction requires a different force profile, a different speed, a different approach angle. A human learns to handle all of these through a lifetime of manipulation experience, compressed into the abstract categories Article 3 described. Building a robot that can do the same, from any starting point, is the deepest practical challenge in embodied RL.

The third problem is generalization. Current RL-trained robots learn narrow skills. A walking policy does not transfer to stair climbing without retraining. A grasping policy for mugs does not transfer to bowls. The brain generalizes effortlessly because its hierarchical representations compress diverse experiences into abstract categories that transfer across tasks. “Graspable object” is not a single learned policy. It is a category that encompasses cups, tools, fruit, and door handles, and that carries expectations about grip strength, approach angle, and weight distribution that apply, with minor adjustments, to objects the person has never touched before. Building the robotic equivalent, a representation that supports zero-shot transfer to novel instances, is the deepest open problem in embodied RL. It connects directly to the world model frontier Article 6 described: an agent with a good enough model of how objects behave can generalize to new objects by predicting their behavior from their appearance, without ever having touched them.

Each of these problems connects back to the series’ central theme. The brain solves them through biological mechanisms that RL is still learning to approximate.

The sim-to-real gap is the brain’s problem of transferring skills across changing conditions. When you walk from a wooden floor onto ice, your first step slips. Within two or three strides, the cerebellum has registered the prediction error between expected and actual foot contact, recalibrated its motor model, and adjusted your gait: shorter steps, lower center of gravity, more cautious weight transfer. The update happens automatically, below conscious awareness, using the same prediction-error mechanism Doya described. No retraining. No new reward signal. Just a fast internal model correcting itself from a few samples of real-world feedback.

Manipulation is the brain’s problem of modeling complex contact physics. Fold a towel. The fabric deforms the moment you touch it. Every grasp changes the geometry of what you are grasping. The forces required shift continuously as the fabric bunches, stretches, and slides. Your motor system handles this through constant prediction and correction: the cerebellum generates a forward model of what the fabric will do in response to your hand movement, compares that prediction to the actual tactile feedback arriving milliseconds later, and adjusts the next movement accordingly. The loop runs hundreds of times per second, below conscious awareness. You do not think about how to fold a towel. Your cerebellum thinks about it for you, updating its predictions in real time as the fabric does something slightly different with every fold. Even folding the same towel in the same way twice produces different contact dynamics. The physics is too complex to pre-compute. It must be modeled on the fly. This is the problem no simulation has solved: not the diversity of objects, but the richness of physical interaction with even a single object.

Generalization is the brain’s problem of abstract categorization. Your visual cortex has compressed thousands of prior cup encounters into a sparse feature set: handle, hollow interior, graspable rim, certain size range. When a cup you have never seen before appears, it activates the same feature set. The motor program associated with “cup-shaped graspable object” is retrieved automatically: approach from the handle side, close fingers around the handle, lift with a force appropriate to the expected weight. You do not re-learn “cup.” You recognize that the new object belongs to an existing category and deploy the associated skill with minor adjustments. This is the hierarchical compression Article 3 described, operating in real time across every interaction.

The solutions the brain has found are not directly transplantable to silicon. But they are evidence about what the solution space looks like, and that evidence has guided RL’s most productive directions for thirty years.

The field that was supposed to be about software has turned out to be, at its frontier, inseparably about bodies. RL was developed for abstract decision-making: games, scheduling, resource allocation. Its deepest open problems are now physical. The agent needs not just eyes and imagination, the subjects of Articles 4 and 6, but hands, feet, and a world that pushes back. The physical world is where the sample efficiency crisis is most acute, where the sim-to-real gap is widest, and where the gap between human and machine capability is most visible. A toddler navigates a cluttered room without thinking. No robot on Earth can do the same yet.

The spiral of this series comes full circle in the physical world. Article 2 showed that the brain and RL algorithms run the same learning equation. Article 5 showed that the brain has a social learning system RL lacks. Article 6 showed that the brain imagines before it acts. This article shows that the brain transitions from exploration to automaticity through distinct neural systems that the industry is independently reinventing. Each turn of the spiral reveals the brain as having solved, through evolution, a problem RL is encountering for the first time. The gap is not closing quickly. But the direction of progress is toward what biology already knows.

The final article asks where this leaves the field as a whole. Self-play, human feedback, world models, embodied learning: four capabilities, each powerful, each incomplete. What are the open questions that cut across all of them? And what does neuroscience suggest about where the answers might be found?

Next in The RL Spiral: The Open Questions.

Robonaissance

Discussion about this post

Ready for more?