The RL Spiral, Part 5: The Self-Play Paradox
Self-play produced the strongest Go player in history. Human feedback produced ChatGPT. RL’s two greatest successes pull in opposite directions.
This is the fifth article in The RL Spiral, an eight-part series on reinforcement learning. The previous article, When RL Learned to See, traced how deep learning gave RL the ability to build its own representations. This one is about what happened when RL stopped needing human data entirely, and why it then needed humans more than ever.



