Scaling Laws and Emergence
When Bigger Becomes Different
This is Chapter 9 of A Brief History of Artificial Intelligence.
In the summer of 2020, a researcher at OpenAI typed a prompt into GPT-3 in a format the model had never been explicitly trained on: a few examples of English sentences paired with their French translations, followed by a new English sentence. The model—175 billion parameters, trained on half a trillion words—completed the pattern. It translated the sentence into French.
This shouldn’t have worked so well. GPT-3 was trained to predict the next word, not to translate languages from in-context examples. No one had fine-tuned it for translation. No one had expected it to work this well or this generally. The researchers tried other tasks: give the model a few examples of questions and answers, and it would answer new questions. Give it examples of code and descriptions, and it would write code from descriptions. Give it examples of anything, and it would figure out the pattern.
They called it “few-shot learning.” It had emerged—appeared spontaneously—as a consequence of scale.
Jared Kaplan, one of the researchers who had mapped the scaling laws that predicted GPT-3’s raw performance, later recalled his reaction: “We knew the loss would be lower. We knew it would be better at predicting text. But we didn’t know it would be able to do all these things.” The model had developed capabilities that weren’t in its training objective, weren’t anticipated by its designers, and couldn’t be explained by simply extrapolating from smaller models.
This chapter is about a discovery that transformed artificial intelligence: that scale—sheer size, data, and computation—doesn’t just make systems incrementally better. It makes them qualitatively different. Capabilities appear as if from nowhere when invisible thresholds are crossed. And no one fully understands why.
The Bitter Lesson
Richard Sutton has spent fifty years thinking about how machines learn. He wrote the textbook on reinforcement learning. He supervised PhD students who went on to build AlphaGo. In 2019, at seventy-one years old, he sat down and wrote a short essay that would become one of the most cited pieces of informal writing in AI.
He called it “The Bitter Lesson.”
The lesson, as Sutton saw it, was simple: AI researchers keep betting on human knowledge, and they keep losing. Across seventy years of the field’s history, the pattern repeated. Researchers would design clever features, encode expert knowledge, build systems that captured human understanding of a problem. And then, eventually, someone would come along with a dumber approach—less knowledge, more computation—and win.
Go was the paradigm case. For decades, researchers built Go programs by encoding what masters knew: opening patterns, shape recognition, strategic principles about territory and influence. These expert systems played respectably but plateaued far below professional level. Then came AlphaZero, which knew nothing about Go except the rules. It learned entirely through self-play—billions of games against itself, pure computation without human knowledge. Within days it had surpassed every human player who had ever lived. Human expertise lost to raw computation and scale.
“The biggest lesson that can be read from 70 years of AI research,” Sutton wrote, “is that general methods that leverage computation are ultimately the most effective, and by a large margin.”
The lesson was bitter because it contradicted researchers’ deepest intuitions. They wanted to believe that understanding mattered, that human insight was essential, that the path to artificial intelligence ran through genuinely grasping how intelligence worked. But again and again, the systems that won were the ones that scaled.
What Sutton couldn’t have known in 2019 was how far the lesson would extend. Language models were about to prove his point beyond anything he had imagined.
Finding the Laws
In early 2020, a team at OpenAI set out to answer a practical question: if you’re going to train a very large language model, how should you allocate your resources? How much should you spend on a bigger model versus more training data? Is there an optimal balance?
The project was led by Jared Kaplan, a physicist by training who had come to AI through an unusual path. He had worked on string theory, studying the mathematical structure of the universe. Now he was studying the mathematical structure of neural networks. The physicist’s instinct—to look for underlying laws—would prove prescient.
Kaplan’s team trained dozens of language models, varying their size from thousands to billions of parameters. They trained them on different amounts of data. They measured how well each model predicted text—its “loss”—and plotted the results.
What they found startled them. The relationship between scale and performance wasn’t noisy or unpredictable. It followed precise mathematical laws. Plot loss against model size on a log-log graph, and you get a straight line. Double the parameters, loss drops by a predictable amount. Double them again, same proportional improvement. The relationship held across three orders of magnitude.
“We expected some relationship,” Kaplan later said. “We didn’t expect it to be so clean.”
The cleanness was what mattered. Complex systems usually don’t behave so simply. Train a model twice as big, and you might expect anything—better performance, worse performance, instability, overfitting. But language models obeyed power laws, as if following some deep mathematical structure that no one had anticipated.
The implications were immediate. If you could predict performance from scale, you could plan. You could calculate in advance whether a training run would be worth the investment. You could optimize the allocation of resources—how much to spend on parameters versus data versus compute. The guesswork went out of AI scaling.
The paper, “Scaling Laws for Neural Language Models,” appeared in January 2020. Within months, it had become required reading at every major AI lab. The age of scaling by faith was over. Now they had equations.
The Numbers
The scaling laws themselves are simple to state, even if their implications are profound.
Loss—how wrong the model’s predictions are—scales with the number of parameters (N), the amount of training data (D), and the compute budget (C) according to power laws. In rough terms:
For each tenfold increase in parameters, loss drops by about 17 percent. For each tenfold increase in data, about 20 percent. For each tenfold increase in compute, about 11 percent.
These percentages might sound modest. They’re not. They compound. A model with a thousand times more parameters and a thousand times more data would have substantially lower loss—and the improvement is predictable before you train it.
The laws also revealed trade-offs. Given a fixed compute budget, you could choose: spend it on a bigger model trained on less data, or a smaller model trained on more data. The optimal balance, the OpenAI team found, depended on your goals and constraints.
Two years later, researchers at DeepMind refined the picture. Their Chinchilla paper showed that previous large models, including GPT-3, had been trained suboptimally—too many parameters, not enough data. The optimal strategy, they argued, was to scale model size and training data roughly equally. A 70-billion parameter model trained on more data could match a 175-billion parameter model trained on less.
This wasn’t abstract theory. It was actionable engineering. Every major lab adjusted their training recipes in response. The Chinchilla scaling laws became industry standard.
When Capabilities Appear
Scaling laws predict loss. But loss is just a number—how well the model predicts text. What researchers actually cared about was what the model could do.
And here, the scaling laws broke down. Not because they were wrong, but because capabilities didn’t follow smooth curves. They emerged.
Consider arithmetic. Ask a small language model to add two three-digit numbers, and it fails. The model is predicting tokens, not computing sums, and it hasn’t seen enough examples to internalize the patterns. Ask a medium-sized model, still fails. The capability seems absent.
Then you cross some threshold—nobody knows exactly where—and suddenly the model can do arithmetic. Not perfectly, but reliably enough to be useful. The capability appears as if switched on.
The researchers called this phenomenon “emergence.” A capability is emergent if it’s absent or barely present in smaller models and then appears relatively suddenly in larger ones. The transition isn’t gradual. The model can’t do it, can’t do it, can’t do it—then it can.
GPT-3 exhibited dozens of emergent capabilities:
Few-shot learning was the most dramatic. Show the model a few examples of any task—translation, summarization, question answering, code generation—and it infers the pattern. This meta-learning ability wasn’t in the training objective. It appeared.
Instruction following emerged somewhere between GPT-2 and GPT-3. Tell GPT-2 to write a poem about the ocean, and it produces ocean-related text but ignores the format request. GPT-3 follows instructions with surprising fidelity.
Multi-step reasoning showed similar discontinuities. Small models answer simple questions but fail when problems require chaining several logical steps. Larger models exhibit something that looks like reasoning—imperfect but functional.
Code generation appeared suddenly. Below some scale, models produce syntactically broken nonsense. Above it, they write working programs.
Why does emergence happen? Honestly, we don’t fully know. One hypothesis: certain capabilities require a minimum amount of “representational capacity”—enough parameters to encode the necessary patterns—and simply don’t work below that threshold. Another: some skills depend on other skills, creating cascades where everything unlocks at once. A third: emergence might partly be an artifact of measurement—capabilities improve smoothly but only become detectable when they cross evaluation thresholds.
The honest answer is that emergence remains poorly understood. We can observe it. We can catalog examples. We cannot yet predict when it will happen or what capabilities will appear.
The Gold Rush
The scaling laws changed the economics of AI research. If capability scales with compute, and compute scales with money, then the richest organizations would build the most capable systems.
The math was stark. GPT-3’s training cost several million dollars. GPT-4’s reportedly exceeded one hundred million. Models now in development may cost billions. The compute required for frontier AI has been doubling every six to ten months—faster than Moore’s Law ever moved chip performance.
This created a new kind of AI research, one that resembles particle physics more than computer science. Training runs become events, planned for months, consuming entire data centers for weeks. The largest experiments are too expensive to repeat; they must work the first time. The organizations capable of competing can be counted on two hands.
Sam Altman, OpenAI’s CEO, has been explicit about the strategy. “The scaling hypothesis,” he said, “is that we can continue to make systems more capable by making them bigger.” OpenAI’s structure—capped profit, massive fundraising, partnership with Microsoft—was designed for this race. So were the structures of competitors: Google’s DeepMind, Anthropic, Meta’s AI lab, Elon’s xAI.
Some researchers watched this concentration with unease. The craft of AI—small teams, clever algorithms, breakthrough insights—was giving way to industrial production. If capability is just a function of investment, what happens to the researchers who can’t afford the investment? What happens to academic AI?
Others argued this was simply how science works when it gets serious. Particle physics requires accelerators. Genomics requires sequencing centers. Why should AI be different?
The debate continues. But the direction is clear: AI development has become big science, with all that implies for competition, concentration, and control.
Where Scaling Breaks
But scaling has limits. Several have already become visible.
Data scarcity is perhaps the most fundamental. Large language models consume text voraciously. GPT-4 was reportedly trained on trillions of tokens. But the internet, though vast, is finite. By some estimates, frontier models have already ingested most of the high-quality text available online. What happens when you run out?
One answer is synthetic data—text generated by AI systems themselves. Train on AI output to create more AI output. This works, to a degree, but carries risks. Train too much on synthetic data and model quality can degrade, a phenomenon researchers call “model collapse.” The diversity and groundedness of human-generated text may be irreplaceable.
Compute costs impose another ceiling. Training frontier models requires thousands of specialized chips running for months. The electricity consumption is enormous—a single training run can use as much power as a small city consumes in a day. The environmental implications are significant and growing.
Diminishing returns are mathematical inevitability. Power laws mean each doubling of scale produces a smaller absolute improvement than the last. Going from terrible to mediocre is easy; going from excellent to superb is hard. At some point, the marginal gains no longer justify the marginal costs.
Algorithmic efficiency offers a different path. Researchers continue to find ways to get more capability from less compute—better architectures, more efficient training, smarter data selection. Each algorithmic improvement is equivalent to some amount of additional scaling. If algorithmic progress accelerates, brute-force scaling might be superseded.
Where are the limits? The honest answer is that no one knows. Optimists argue that data constraints can be solved with synthetic generation and multimodal learning; that compute costs will fall with better hardware; that we’re nowhere near the ceiling. Pessimists argue that scaling has already begun to hit walls, that the returns are diminishing faster than the costs are falling, that something fundamentally different will be needed.
Both sides can point to evidence. Neither can prove their case. We’ll find the limits by reaching them.
The Paradox at the Heart
The scaling laws create a strange situation. We can predict performance with remarkable accuracy. Tell me your parameter count, training tokens, and compute budget, and I can estimate your model’s loss before you train it. The equations work.
But we cannot predict capabilities. We don’t know what a larger model will be able to do. Emergence means abilities appear unpredictably, without warning, at thresholds we can’t anticipate.
This creates profound challenges. If you’re trying to assess whether a model is safe to deploy, you can’t just measure its size. A model slightly larger than a safe predecessor might have qualitatively different capabilities—including potentially dangerous ones. The transition isn’t smooth. There’s no gentle slope to evaluate. There are cliffs.
It also means we’ve built systems we don’t fully understand. Language models are trained on simple objectives—predict the next token—yet they develop capabilities far beyond that objective. They learn to reason, to code, to follow instructions, to learn from examples. None of this was directly trained. It emerged from scale.
We predicted the performance. We didn’t predict the capabilities. We still don’t understand why they appeared.
Closing: The Shape of Progress
This chapter has traced a discovery that reshaped artificial intelligence: scale produces capability, and capability emerges unpredictably from scale.
The scaling laws gave the field a roadmap. Invest in compute, collect more data, build bigger models, and performance improves predictably. The equations worked. The gold rush followed.
But the laws also revealed how much we don’t understand. Why does few-shot learning emerge at a certain scale? Why does arithmetic suddenly work? What other capabilities are waiting at thresholds we haven’t crossed? The systems we’ve built have surprised their creators at every step.
Scale wasn’t just making systems better at what they already did. It was making them capable of things they couldn’t do before. Intelligence—or something that functions like it—was emerging from the sheer accumulation of parameters and data.
But even exponential growth eventually slows. The data runs short. The costs compound. The returns diminish. The question isn’t whether scaling will hit limits—it will. The question is what will still be missing when it does, and whether we’ll understand enough to find another path.
Notes and Further Reading
On Scaling Laws
Jared Kaplan’s “Scaling Laws for Neural Language Models” (OpenAI, 2020) established the field’s foundation. The paper is technical but readable, and the key graphs—those remarkably straight lines on log-log plots—are worth examining directly. DeepMind’s “Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022), the Chinchilla paper, refined the laws and changed industry practice. For accessible discussion, Kaplan has given several talks explaining the intuitions behind the mathematics.
On Emergence
“Emergent Abilities of Large Language Models” by Wei et al. (Google, 2022) catalogs dozens of emergent capabilities and attempts to characterize when they appear. It’s the most systematic treatment available. For a contrarian view, “Are Emergent Abilities of Large Language Models a Mirage?” (Schaeffer et al., 2023) argues that some emergence may be an artifact of how we measure rather than a sharp transition in actual capabilities. The debate remains unresolved.
On The Bitter Lesson
Richard Sutton’s essay is short—under two thousand words—and freely available on his website. It’s one of the most influential pieces of informal writing in AI and provides essential intellectual context for the scaling revolution. Whether you agree with Sutton’s conclusions or not, understanding his argument is necessary for understanding why the field moved the way it did.
On Limits and the Future
The question of how far scaling can go generates constant debate. For the optimist’s case, Ilya Sutskever’s talks (as OpenAI’s former chief scientist) articulate why scaling might continue to produce breakthroughs. For skepticism, Yann LeCun has repeatedly argued that pure scaling of language models won’t reach general intelligence and that fundamentally different approaches are needed. The disagreement among leading researchers is itself informative: if the experts don’t agree, perhaps we genuinely don’t know.
On Costs and Concentration
Estimates of training costs for frontier models are notoriously imprecise—companies don’t disclose them—but journalistic investigations have attempted reconstructions. For environmental concerns, Strubell et al.’s “Energy and Policy Considerations for Deep Learning in NLP” (2019) raised early alarms, though subsequent work has complicated the picture. The concentration of AI capability in a handful of organizations is analyzed in various policy papers; Stanford’s AI Index provides useful annual data on compute concentration.


