The Journey of RL, Part 10: When Scale Replaced Signal

Faced with the failure of reward modeling, the field’s loudest response was lateral: stop trying to fix reward, fix the data and the compute instead.

Jul 04, 2026

∙ Paid

In March 2019, one of the founders of reinforcement learning posted a short essay to his personal website, no journal, no peer review, about twelve hundred words, and it landed on the field like a verdict. The author was Richard Sutton, who three decades earlier had helped write down the reward hypothesis, the proposition that opened this whole series: that any goal worth pursuing can be cast as a reward to be maximized. If anyone had a claim to be the architect of the idea that intelligence is reward maximization, it was Sutton. And the essay he published in 2019, titled “The Bitter Lesson,” read almost like a confession about the limits of everything clever the field had built on that idea.

The lesson was this. Looking back over seventy years of artificial intelligence research, Sutton argued, one pattern recurs above all others: general methods that leverage computation are ultimately the most effective, and by a large margin. Again and again, researchers had poured their ingenuity into encoding human knowledge into their systems, hand-crafting the features, the rules, the domain expertise, and again and again those careful constructions had been overtaken by simpler methods that simply used more computation, more search, more learning, more data. It had happened in chess, in Go, in speech recognition, in computer vision. The humbling part, the reason the lesson was bitter, was that the human insight so lovingly built in had mattered far less than anyone wanted to believe. What mattered was scale, and scale kept arriving, year after year, as computation grew cheaper.

Coming from Sutton, this was a remarkable thing to say. The man who had helped make reward the center of the field was now pointing past cleverness of every kind, including, by implication, the clever reward engineering that Parts 7 and 8 had chronicled, toward the brute and reliable power of computation. The field read it as marching orders. And the timing could not have been sharper, because just as the Bitter Lesson was being absorbed, the reward signal itself was falling into the crisis of Part 9. If clever reward design was collapsing under Goodhart’s law, and if the lesson of seventy years was that scale beats cleverness anyway, then the path forward seemed obvious. Stop trying to be clever about reward. Scale.

Continue reading this post for free, courtesy of Hugo.

Or purchase a paid subscription.

Robonaissance

The Journey of RL, Part 10: When Scale Replaced Signal

Faced with the failure of reward modeling, the field’s loudest response was lateral: stop trying to fix reward, fix the data and the compute instead.

Continue reading this post for free, courtesy of Hugo.