The RL Spiral, Part 5: The Self-Play Paradox

Self-play produced the strongest Go player in history. Human feedback produced ChatGPT. RL’s two greatest successes pull in opposite directions.

Mar 25, 2026

∙ Paid

This is the fifth article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, When RL Learned to See, traced how deep learning gave RL the ability to build its own representations. This one is about what happened when RL stopped needing human data entirely, and why it then needed humans more than ever.

Continue reading this post for free, courtesy of Hugo.

Or purchase a paid subscription.

Robonaissance

The RL Spiral, Part 5: The Self-Play Paradox

Self-play produced the strongest Go player in history. Human feedback produced ChatGPT. RL’s two greatest successes pull in opposite directions.

Continue reading this post for free, courtesy of Hugo.