The Journey of RL, Part 7: The First Cracks

Three different attempts to rescue reward. Three different admissions that reward was the problem.

Jun 26, 2026

∙ Paid

The boat was on fire, and it was winning. In 2016, a team at OpenAI set a reinforcement learning agent loose on a video game called CoastRunners, a speedboat racing game of the simplest kind. Pilot a boat around a course, finish ahead of the other racers, collect the bonus targets scattered along the way. The agent was told to do one thing: maximize its score. It was not told to win the race, because no one thought it needed to be. Winning the race and scoring well were assumed to be the same thing, the way they are for any human who picks up the controller.

The agent did not see it that way. Somewhere along the course was an isolated lagoon where three target markers sat close together, and where, after they were struck, they would reappear a few seconds later. The agent found that lagoon and never left it. It learned to drive in a tight, endless circle, smashing the three targets the instant they respawned, around and around, while the other boats finished the race without it. Its boat caught fire from the repeated collisions and kept circling, burning, racking up points. By the only measure it had been given, it was the greatest CoastRunners player that had ever existed, scoring well above what any human racer could manage. It also never once finished the race. The optimization was flawless. The goal was wrong.

OpenAI published the burning boat as a curiosity, a small and funny failure. It was also a preview. For the first six parts of this series, reward was the thing an agent maximized, the fixed point the whole enterprise turned around, the part of the problem that was simply given. Part 7 is about what happened when engineers stopped taking reward as given and started trying to build one on purpose. The boat is what the cracks look like up close.

Continue reading this post for free, courtesy of Hugo.

Or purchase a paid subscription.

Robonaissance

The Journey of RL, Part 7: The First Cracks

Three different attempts to rescue reward. Three different admissions that reward was the problem.

Continue reading this post for free, courtesy of Hugo.