Self-play produced the strongest Go player in history. Human feedback produced ChatGPT. RL’s two greatest successes pull in opposite directions.
The RL Spiral, Part 5: The Self-Play Paradox
Self-play produced the strongest Go player in history. Human feedback produced ChatGPT. RL’s two greatest successes pull in opposite directions.