The RL Spiral, Part 3: The Curse Bellman Couldn’t Break
RL can master any game in days. A toddler still learns faster. The reason is seventy years old, and nobody has fixed it.
This is the third article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, The Equation That Explains Your Brain, showed that the brain and RL algorithms run the same learning equation. This one asks why the brain learns so much faster.
A robot arm in a research lab is trying to learn to pick up a mug. Not a specific mug in a specific position. Any mug, in any orientation, on any surface. The learning method is reinforcement learning. The robot reaches, grasps, lifts. If the mug comes up, the reward signal is positive. If it slips or falls, the signal is negative. The robot updates its policy and tries again.
After ten thousand attempts, the robot can pick up the mug it trained on, in roughly the position it trained in, on the surface it trained on. Move the mug two inches to the left, and success rate drops. Replace the mug with a slightly different one, and the robot starts over. Change the lighting, and the visual system falters. Ten thousand grasps, and the robot has learned a narrow, brittle skill that does not transfer.
A human toddler picks up a mug on the first or second try. Not because the toddler has better motors or sharper vision. Because the toddler has already learned, from a few years of reaching for everything in sight, a set of general principles about how objects behave when you grab them. The mug is new. The skill is not.
The gap between these two learners is not about hardware. It is not about the quality of the reward signal. It is about how many experiences each learner needs before it can act competently in a new situation. Researchers call this sample efficiency, and it is the single most important unsolved problem in reinforcement learning. Every other limitation of RL, its brittleness in new environments, its failure to transfer skills, its dependence on simulation, traces back to this: the algorithm needs too much experience.
The problem has a name. A mathematician named Richard Bellman gave it one in 1957. He called it the curse of dimensionality. He was describing a mathematical fact about the relationship between the complexity of a problem and the amount of computation needed to solve it. Sixty-eight years later, that curse is still RL’s central obstacle. Understanding why it exists, why it has proven so resistant to solution, and what the brain does differently requires going back to the decade when both RL and its deepest problem were being formulated for the first time.
Richard Bellman arrived at the RAND Corporation in Santa Monica in the late 1940s, one of a group of mathematicians recruited to work on problems the military did not yet know how to formalize. RAND, originally an Air Force contract research project, was becoming the prototype of the Cold War think tank: a place where mathematicians, physicists, and economists sat in the same building, looked out at the Pacific Ocean, and tried to turn strategy into mathematics. The problems were as varied as the people. How to allocate bombers. How to schedule supply chains. How to price uncertainty. Bellman’s contribution was to formalize a class of problems that had no good name. The problems all had the same structure: a decision-maker faces a sequence of choices, each one affecting what happens next, and the goal is to make the sequence of choices that produces the best long-term outcome. Which route should a supply chain follow? How should an inventory system reorder stock? When should a missile guidance system adjust its trajectory? These are all, at their core, the same problem: sequential decision-making under uncertainty.
Bellman’s breakthrough was showing that any problem with this structure could be broken into pieces. You never need to plan the entire sequence at once. The value of where you are now equals the reward you get right now, plus the value of where you end up next. That next position is itself just its immediate reward, plus the value of the position after that. Follow this chain far enough, and you have a complete account of long-term value. Bellman wrote this insight as an equation. He called the method dynamic programming, a name he chose, according to his autobiography, partly because it sounded impressive enough to avoid attracting hostility from a Secretary of Defense who disliked the word “research.” The equation became the mathematical foundation on which all of reinforcement learning would eventually be built.
The problem was that solving the equation required listing every possible state the system could be in, every possible action it could take, and every possible outcome of each action. For simple problems, this is manageable. A thermostat has a few dozen relevant states. A tic-tac-toe board has a few thousand. But the number of states does not grow gradually as a problem gets more complex. It grows exponentially.
Consider what exponential growth means in practice. A robot arm has seven joints. Each joint can be in, say, a hundred different positions. If you discretize the arm’s configuration space at that resolution, the total number of possible arm positions is one hundred to the seventh power: one hundred trillion. That is just the arm, frozen in space, not doing anything. Now let it interact with a single mug. The mug has a position on the table, two dimensions. An orientation: it can lean forward or backward, tilt left or right, or spin. Three more. A weight. A surface friction coefficient. That adds seven more dimensions to the space. One hundred to the fourteenth power: one hundred octillion possible states. Add a second object on the table and the number grows by another seven orders of magnitude. Now add the visual input: a camera feed at modest resolution contains hundreds of thousands of pixels, each carrying color information. The full state space of “a robot arm looking at a table of objects” contains more possible states than there are atoms in the galaxy. And the robot needs to learn a good policy not by visiting every state but by visiting enough states that it can generalize to the rest. In a space this large, “enough” is an astronomically large number.
Bellman saw this clearly. In his 1957 book Dynamic Programming, he gave the problem a name: the curse of dimensionality. The growth rate is what makes it a curse, not a challenge. Ten evenly spaced points cover a line. The same density in two dimensions requires a hundred points. In three dimensions, a thousand. In ten dimensions, ten billion. In fifty dimensions, a number so large it has no physical analogue. Each new dimension multiplies the required data by another factor. The data you need grows exponentially with the number of dimensions. No computer, present or future, can overcome this by brute force. It is not an engineering limitation. It is a mathematical one.
Bellman himself recognized this early and began working on approximation methods. But the operations research community largely shelved the approximation problem for two decades. The computers of the 1950s and 1960s were too slow to make approximation practical, and the problems people were trying to solve were small enough that exact methods still worked. The curse was acknowledged, then politely ignored.
In the same era, a parallel tradition was taking shape at MIT. Norbert Wiener, a mathematician with wide-ranging interests in physiology and philosophy, published Cybernetics in 1948. The book proposed treating control and communication in machines and living organisms as instances of the same problem. Wiener’s key insight was feedback: a system that monitors its own output and adjusts its behavior accordingly. A thermostat is a feedback system. So is a person reaching for a glass of water, using vision to correct the trajectory in real time. Wiener was the first to formally argue that the brain and the machine are solving the same control problem through the same fundamental mechanism. His work became the foundation of control theory. Bellman’s became the foundation of reinforcement learning. The two traditions shared an origin and a question: how should a system make decisions in a complex, uncertain world? They diverged on method. Control theory assumed you had a model of the system and could compute the right action from the model. Reinforcement learning assumed you did not have a model and had to learn from trial and error. That fork is where the curse bites hardest. If you have a model, you can use it to focus your computation on the parts of the space that matter. A self-driving car with a physics model of tire friction and road curvature can compute the right steering angle without ever having skidded off the road. If you do not have a model, you must explore, and exploration in a high-dimensional space is where the exponential cost lives. The car without a model has to try different steering angles at different speeds on different road surfaces, crashing many times, before it learns what works. The model buys you something priceless: the ability to reason about states you have never visited. Without it, you have to visit them.
Temporal difference learning, the algorithm Sutton developed in 1988 and the previous article traced in detail, was a partial answer to the curse. It let a learner approximate Bellman’s equation from experience, without needing a map of the environment. The learner could update its estimates at every step, not just at the end of an episode, and each update incrementally improved the estimate of long-term value. This was genuine progress. It turned an impossible computation into a feasible learning process.
But it did not break the curse. It only changed where the curse showed up. Instead of needing an exponential number of states to solve the equation exactly, you needed an exponential number of experiences to estimate it accurately. The learning was faster per step. The number of steps was still enormous. And in a continuous, high-dimensional space, even an algorithm that learns from every step can take an impractical amount of time to converge, because the space between any two visited states contains an essentially infinite number of unvisited ones. The algorithm has no way to know what those unvisited states are worth without visiting them or something close to them.
This is why a robot arm needs ten thousand grasps to learn one mug. Not because the algorithm is bad. Because the state space of “a robot arm interacting with objects in a room” is so vast that ten thousand samples covers almost none of it. The robot has explored a tiny corner of the space. Everything outside that corner is unknown. When the mug moves two inches, the robot is in unfamiliar territory, and its policy has nothing useful to say.
The human toddler faces the same state space. The room has the same number of dimensions. The physics is the same. The difference is in how the toddler represents the problem.
A toddler reaching for a mug does not represent the problem as “the position and velocity of every joint in my arm, combined with the three-dimensional shape, position, orientation, weight, and surface friction of the object in front of me.” That would be a state space of dozens of continuous dimensions, and learning from scratch in that space would take thousands or millions of trials, just as it does for the robot.
Instead, the toddler’s brain compresses the problem. Years of prior experience have built a hierarchy of representations: objects are solid, graspable things obey certain rules, heavy things need more force, round things roll. These representations reduce a high-dimensional problem to a much lower-dimensional one. The toddler does not need to learn the physics of this specific mug. The toddler needs to classify the mug into a category, “roughly cylindrical graspable object,” and apply a general strategy that has already been learned for that category. The state space collapses from millions of possible configurations to a handful of relevant variables.
This is how the brain partially circumvents the curse. Not by solving the high-dimensional problem, but by refusing to operate in the full-dimensional space. The brain builds compressed representations that capture what matters and discard what does not. Neuroscientists studying this process have found that it operates hierarchically: raw sensory input is progressively compressed as it moves through layers of cortical processing, each layer extracting more abstract and more behaviorally relevant features. By the time information reaches the prefrontal and parietal regions responsible for decision-making, much of the original dimensionality has been stripped away. What remains is a lower-dimensional map of the situation, organized around the features that predict reward.
The compression is not fixed. The brain learns which dimensions matter for a given task and adjusts its representations accordingly. When you learn to drive, you start by attending to everything: the steering wheel, the pedals, the mirrors, the road markings, the cars around you, the dashboard instruments, the noise of the engine. The experience is overwhelming precisely because you are operating in a high-dimensional state space where nothing has been compressed yet. With practice, your brain learns which inputs predict what happens next and discards the rest. You approach an intersection. A learner driver is tracking the light, the pedestrians, the lane markings, the speedometer, the mirrors, the cars in every direction. An experienced driver sees the same intersection and reads three things: the light has been green for a while, so it may change; the oncoming car in the left-turn lane is creeping forward; the pedestrian on the right is looking at a phone, not at the road. Same scene. Dozens of dimensions compressed to three that predict what happens next. The state space has not changed. The effective dimensionality has dropped. The curse has not disappeared. It has been sidestepped.
Research in computational neuroscience has formalized this process. Studies using brain imaging show that the brain’s parietal and prefrontal regions construct abstract, low-dimensional representations of tasks, even when the sensory input is complex and high-dimensional. Subjects playing Atari games in brain scanners show neural representations that are strikingly similar to the compressed representations learned by deep RL algorithms playing the same games. The brain and the algorithm appear to converge on similar solutions to the same compression problem, even though they arrive there by different routes.
The brain also compresses across tasks, not just within them. A person who has learned to use a screwdriver, a wrench, and a pair of pliers has built a general representation of “hand tool” that captures what these objects have in common: they extend the hand’s reach or force, they require a specific grip, they transmit torque. When that person encounters a new tool they have never seen before, the general representation activates. The new tool is not a blank slate. It is already classified into a category that carries expectations about how it behaves. Most of the learning has already been done. Only the specifics need to be filled in.
The difference in speed between the brain and an RL algorithm comes down to this. The brain’s hierarchical compression is built on a lifetime of prior experience and can be deployed in a new task almost immediately. The RL algorithm must learn its compression from scratch for each new task, which is itself a high-dimensional learning problem.
This is the core of the sample efficiency gap. The robot does not just need to learn a policy. It needs to learn a representation, then learn a policy on top of that representation. Both learning problems are large. Combined, they are enormous. The brain has the first one mostly solved before the second one even begins. It arrives at every new task with a representational head start built up over a lifetime. The robot arrives with nothing.
The practical consequence of the curse is clearest in robotics, the domain where RL meets the physical world and sample efficiency stops being an academic question and becomes an economic one.
A simulation can run millions of training episodes at negligible marginal cost. Simulated physics, simulated cameras, simulated objects. The robot can fail millions of times without wearing out a single motor. This is why most RL research happens in simulation. The curse of dimensionality is still present, but its cost is measured in compute, not in time or hardware.
A physical robot cannot do this. Each real-world trial takes time, wears out hardware, and risks damage. If an RL system needs a million trials to learn a task, and each trial takes ten seconds on a real robot, that is approximately four months of continuous, round-the-clock operation for one skill. A warehouse robot that needs to handle a hundred different objects faces centuries of training time. The numbers do not work.
The field’s current best answer is to train in simulation and transfer to reality. But simulation is, by definition, a simplified model of the world. Contact forces between a robot gripper and a cardboard box depend on the box’s moisture content, the angle of approach, the wear on the gripper’s rubber pads, the temperature of the room. No simulation captures all of this. The gap between simulation and reality, what researchers call the sim-to-real gap, is the curse of dimensionality expressed as an engineering problem. The dimensions that the simulation leaves out are exactly the dimensions where the real-world policy fails.
In 2019, OpenAI demonstrated a robotic hand that could solve a Rubik’s cube one-handed. The achievement required the equivalent of thirteen thousand years of simulated experience, distributed across massive computational resources, using a technique called automatic domain randomization that generated progressively harder training environments without human tuning. Even then, the hand completed a full solve only about twenty percent of the time for the hardest scrambles, and sixty percent for easier ones. The project demonstrated that sim-to-real transfer can work for a single, specific, well-defined manipulation task. It also demonstrated the cost: an extraordinary volume of simulated experience, measured in millennia, was needed to produce a skill that a human child masters in a few hours of play.
Domain randomization, the technique OpenAI used, tries to bridge the sim-to-real gap by varying the simulation randomly during training: different lighting, different textures, different physics parameters. The hope is that if the agent learns to succeed across a wide range of simulated conditions, it will be robust enough to handle the conditions it encounters in reality. The technique works, partially. It improves robustness for some tasks. It also increases training time, sometimes by an order of magnitude, because the agent must now learn a policy that works across a much wider range of conditions. The curse reasserts itself: broadening the training distribution means spreading the same amount of learning across a larger space.
A child does not train in simulation. A child trains in the real world, on real objects, with real physics. And a child needs, for most manipulation tasks, a remarkably small number of trials. The difference is not mystical. It is representational. The child’s brain has already built, through years of prior experience, a set of compressed representations that reduce the effective dimensionality of each new task to something manageable. The representations transfer. The learning time drops. What looks like effortless one-shot learning is actually the final step of a long process of hierarchical compression that started in infancy.
Building systems that can do something similar is the active frontier of reinforcement learning. Representation learning, hierarchical RL, foundation models for robotics, world models that learn the structure of environments before trying to act in them: these are all, at their core, attempts to reduce the effective dimensionality of the problems RL is asked to solve. They are attempts to give RL what the brain has: a way to see the mug, not the million-dimensional state space the mug lives in. The most dramatic single step in this direction happened in 2013, when a team at a small London startup called DeepMind combined a deep neural network with an RL algorithm and pointed it at a set of Atari games. The network learned, from raw pixels alone, to extract the features that mattered. For the first time, an RL system could build its own representations. It was not the end of the curse. It was the beginning of a new approach to living with it.
Richard Bellman named the curse in 1957. He knew it was structural, not accidental, and that no increase in computing power would make it go away. Nearly seven decades later, the deepest progress toward breaking it has come not from faster computers or better algorithms in isolation, but from a different question entirely: how do you build representations that make the problem small enough to learn from? The brain answered that question hundreds of millions of years ago, through a long evolutionary process of learning to compress, abstract, and transfer. RL is still working on it.
Next in The RL Spiral: When RL Learned to See.


