The Transformer Revolution
How Attention Became “All You Need”. What transformers really changed. Why context became computation.
This is Chapter 7 of A Brief History of Artificial Intelligence.
In June 2017, eight researchers at Google Brain published a paper with a title so audacious it bordered on arrogance: “Attention Is All You Need.”
Ashish Vaswani, the lead author, was a soft-spoken researcher who had joined Google two years earlier. His team was working on machine translation—teaching computers to convert text from one language to another. The dominant approach used recurrent neural networks that processed sentences word by word, maintaining a hidden state that carried context forward. It worked, after a fashion. But it was slow, hard to train, and struggled with long sentences.
Vaswani’s team proposed something radical: throw out recurrence entirely. No sequential processing. No hidden state being passed forward. Just attention—a mechanism that let every word in a sentence look at every other word simultaneously and decide what mattered.
The reviewers at the Neural Information Processing Systems conference were skeptical. One asked whether the authors had actually run the experiments they claimed, or if this was just a theoretical proposal. The results seemed too good. Translation quality that beat the best recurrent models, trained in a fraction of the time.
But the experiments were real. And within eighteen months, transformers—as the architecture came to be called—would replace recurrent networks in virtually every major natural language processing system. Google Translate. Machine translation. Question answering. Eventually, systems like BERT and GPT that would reshape how we interact with AI.
The revolution happened fast. But it built on a foundation that took years to establish. The story of how attention became “all you need” is a story about recognizing what matters.
The Sequential Trap
In the early 2010s, recurrent neural networks seemed like the natural way to process language. Language is sequential. You read sentences word by word, left to right, building up understanding as you go. A neural network should do the same.
Ilya Sutskever, then a graduate student working with Geoffrey Hinton, helped pioneer sequence-to-sequence models. The idea was elegant: one recurrent network reads a sentence, encoding it into a fixed-size vector. Another recurrent network reads that vector and generates a translation. For the first time, neural networks could translate entire sentences without linguistic rules.
It worked. Google deployed recurrent networks for translation in 2016. The quality improvements were dramatic—sentences that were once gibberish became coherent. Users noticed. This was the first time most people encountered neural networks doing something unmistakably useful.
But the researchers knew there were problems.
The bottleneck was brutal. Every sentence, no matter how long or complex, had to be compressed into a single vector. A five-word sentence and a fifty-word sentence used the same representation size. The network had to cram all the meaning, all the context, all the relationships into that fixed-size bottleneck.
Sequential processing was slow. Modern GPUs could perform thousands of calculations simultaneously. But recurrent networks forced sequential computation. To process word 50, you had to first process words 1 through 49. No parallelization. No speedup. Just plodding, word by word.
Long-range dependencies degraded. Information from early in a sentence had to propagate through dozens of steps to influence later words. At each step, a little information was lost. By the time you reached the end of a paragraph, the beginning was a faint echo.
Yoshua Bengio, one of the pioneers of neural networks, had wrestled with these problems for years. In a 2014 paper, he and his student Dzmitry Bahdanau proposed a solution: attention.
Learning to Look Back
Bahdanau was working on machine translation at the University of Montreal. He noticed something about how the recurrent models failed. When translating long sentences, they would start strong, then drift into nonsense halfway through. The fixed-size bottleneck couldn’t hold enough information.
His insight was simple. When generating a translation, why compress the entire source sentence into one vector? Why not let the network look back at the source words and decide which ones matter for each target word?
Here’s what he proposed: When translating “The cat sat on the mat” to French, as you generate “chat” (cat), let the network look at all source words and attend strongly to “cat.” When generating “tapis” (mat), attend to “mat.” When generating “sur” (on), attend to “on.”
This attention wasn’t programmed. The network learned it. Train on millions of translated sentences, and the network discovers which source words predict each target word. The attention weights—how strongly to attend to each source word—emerge from training.
The results surprised everyone. Translation quality jumped. Long sentences that had been incomprehensible became accurate. The attention weights revealed what the network had learned—grammatical relationships, semantic connections, the structure of language itself.
Bengio saw immediately what this meant. “We don’t need to compress everything,” he told his research group. “The network can access what it needs directly.”
But attention in 2015 was still just a tool, an add-on to recurrent networks. The sequential processing remained. Words were still processed one at a time. Attention helped, but it didn’t solve the fundamental problem.
That would require a more radical move.
The Audacious Proposal
Vaswani joined Google in 2015. He was assigned to work on translation with Jakob Uszkoreit, a researcher who had been thinking about attention mechanisms. They kept asking the same question: if attention is so powerful, why do we still need recurrence?
The standard answer was: you need recurrence to process sequences. How else would the network know what order words appeared in? How would it build up understanding incrementally?
But Vaswani and Uszkoreit wondered if those were the right questions. Maybe you didn’t need sequential processing. Maybe you could process all words simultaneously, using attention to capture relationships. Maybe position could be encoded another way.
They started experimenting in late 2016. The initial results were strange—the networks learned, but slowly, and the quality was inconsistent. Noam Shazeer, a senior researcher at Google, joined the project. He suggested a crucial modification: use multiple attention heads, each learning different types of relationships.
Llion Jones contributed positional encodings—a clever way to inject information about word order without sequential processing. Aidan Gomez, an intern from the University of Toronto, helped implement and debug the architecture. Łukasz Kaiser and Illia Polosukhin worked on training strategies that made the system stable.
By early 2017, they had something working. And it worked shockingly well.
The architecture—they called it the Transformer—processed entire sentences in parallel. Every word attended to every other word simultaneously. No sequential bottleneck. No information degradation. Just attention, deciding what mattered.
They trained it on English-to-German translation. The standard benchmark at the time was achieving a BLEU score of around 26 (a common translation quality metric). The best recurrent models with attention reached about 26.4. The Transformer hit 28.4.
That’s not a small improvement. That’s a leap.
They submitted the paper to NIPS (now NeurIPS) with that audacious title: “Attention Is All You Need.” The reviews were mixed. Some reviewers were excited. Others questioned whether this could possibly work as well as claimed.
At the conference in December 2017, the paper presentation drew a packed room. The questions were sharp, skeptical. Had they really eliminated recurrence? Was this just a trick that worked for translation but wouldn’t generalize?
Vaswani stood at the podium and showed the results. Translation quality. Training time. Attention weights visualizations showing what the network learned. The skepticism began to crack.
Within six months, every major research lab was experimenting with transformers. Within a year, they were the new standard. The revolution had begun.
What Transformers Actually Do
To understand why transformers work, you need to understand self-attention.
Why “self” attention? In early attention mechanisms (like Bahdanau’s 2015 translation work), attention operated between two different sequences—a source sequence and a target sequence. The translator attended from target words back to source words. Two different things attending to each other.
Self-attention is different. The sequence attends to itself. Every word in a sentence attends to every other word in the same sentence. “Self” because it’s self-referential—the input attends to the input.
Think about reading this sentence: “The cat sat on the mat because it was tired.”
What does “it” refer to? Your brain doesn’t process this word by word, building up a hidden state. You look at the whole sentence. You see “cat” and “it” and recognize the connection. You attend from “it” to “cat” to resolve the reference. You attend from “tired” to “cat” to understand what is tired. You attend from “because” to “sat” to understand the causal relationship.
Self-attention lets neural networks do the same thing.
Here’s how it works. For each word in the sentence, the network computes three things: a query, a key, and a value.
The names come from database retrieval, and the analogy is precise:
The query is a question: “What information do I need to understand this word?” The key is an index: “Here’s what information I have to offer.” The value is the actual information to be retrieved.
Why separate key and value? This is subtle but crucial.
Imagine a library. Books have titles (keys) and contents (values). When you search, you match your query against titles to find relevant books. But what you retrieve is the contents, not the titles.
Key and value serve different functions:
The key determines relevance: “Should I attend to this word?”
The value provides information: “What should I learn from this word?”
Consider “The bank by the river.” The word “river” serves two roles:
As a key, it signals: “I provide geographical/physical context.” This makes “bank” query it: “Do you help me understand what kind of bank this is?”
As a value, it provides: “Water, nature, physical location.” This is what “bank” actually retrieves to understand it’s a riverbank, not a financial institution.
If key and value were the same, you couldn’t separate “what makes me relevant?” from “what information do I provide?” You need both: matching for relevance and content for retrieval.
Mathematically, this means having separate learned transformations. For each word, the network learns:
How to construct queries: “Given this word, what should I ask for?”
How to construct keys: “Given this word, what do I offer as relevant?”
How to construct values: “Given this word, what information do I provide?”
These are three different learned linear transformations of the word’s representation. They’re not the same because they serve different purposes.
For the word “it” in our example:
Its query asks: “What am I referring to? I need noun information, probably animate, probably singular.”
It compares this query to all keys in the sentence:
“cat” key: “I’m a noun, animate, singular” → Strong match
“mat” key: “I’m a noun, inanimate, singular” → Weaker match
“sat” key: “I’m a verb, action” → Weak match
“tired” key: “I’m an adjective, state” → Weak match
Based on these matches, “it” attends most strongly to “cat.”
But what does “it” retrieve? Not the key (”I’m a noun, animate, singular”) but the value: the full semantic representation of “cat”—including that it’s an animal, has fur, can get tired, etc. This is the information “it” needs to inherit.
The attention weights—how strongly to attend to each word—come from comparing queries to keys. Mathematically: take the query for “it,” take the keys for all words, compute how similar they are (dot product), convert similarities to weights (softmax), and use these weights to take a weighted average of the values.
The result: “it” builds its representation by pulling information from other words, weighted by relevance. It attends strongly to “cat,” so it gets most of its semantic content from “cat’s” value. It attends weakly to other words, so it gets little from them.
This happens for every word simultaneously. “Cat” attends to other words to build its representation. “Tired” attends to other words to build its. Every word queries every other word’s key, then retrieves from their values based on relevance.
And crucially, this all happens in parallel. All attention calculations occur simultaneously. No sequential processing. No waiting for word 1 before processing word 2. Pure parallel computation.
Multi-head attention runs this entire process multiple times in parallel with different learned query/key/value transformations. Why? Because words need different types of information simultaneously.
One head might ask: “What’s the subject of this verb?”
Another might ask: “What earlier element does this refer back to?”
Another might ask: “What other elements play a similar role, even if they’re far away?”
Another might ask: “What happened before that makes this step necessary?”
These roles aren’t designed in advance; each head learns to ask different questions during training—different query/key/value transformations—capturing different types of relationships. Then their outputs are combined to form a rich representation incorporating multiple relationship types.
Stacking layers creates hierarchies of relationships. Early layers learn local patterns—adjacent words, basic grammatical relationships. Middle layers learn phrase-level and sentence-level structure. Late layers learn document-level organization and complex semantic relationships.
Each layer takes the representations from the previous layer and refines them through another round of self-attention. The first layer might help “bank” figure out it’s a riverbank. The second layer might help the whole phrase “bank by the river” figure out how it relates to the rest of the sentence. The third layer might connect this to broader discourse context.
The internal geometry: Through training, words develop geometric relationships in the high-dimensional representation space. Related words (by grammar, by meaning, by usage) occupy nearby regions. Attention weights reflect these geometric relationships—words attend strongly to nearby neighbors in representation space.
This geometric structure is what makes Transformers work. The representation space organizes itself to reflect linguistic structure. Attention weights navigate this geometry, pulling in relevant information from anywhere in the sequence.
The beauty of the mechanism: It’s conceptually simple—queries match keys, retrieve values—but powerful enough to capture arbitrary relationships in sequences. And it’s fully parallel, exploiting modern GPU capabilities.
No recurrence. No sequential bottleneck. Just every element attending to every other element simultaneously, constructing context dynamically through learned relevance.
The Context Problem Solved
The transformation wasn’t just technical. It was conceptual.
In recurrent networks, context was something you carried—a hidden state summarizing everything seen so far, passed forward to each new word. This made context fragile. It degraded. It had to compress everything into fixed size.
In transformers, context is something you construct. Each word builds its representation by attending to other words. The context for “cat” in “the cat sat on the mat” is different from its context in “the cat in the hat,” not because different information flowed forward, but because different attention patterns select different relevant information.
This means context is flexible. The same word develops different representations based on what it attends to. “Bank” in “river bank” attends to “river” and “shore.” “Bank” in “savings bank” attends to “money” and “account.” The representations diverge naturally.
It means context is unlimited. There’s no bottleneck. Each word can access any information in the sequence. The context for understanding a word at position 100 can include words at positions 1, 37, and 94 if those are relevant. Nothing needs to be compressed or carried forward.
And it means context is computed efficiently. All attention calculations happen in parallel. GPUs excel at parallel matrix operations. Transformers play to their strengths.
Vaswani’s co-author Jakob Uszkoreit later reflected: “We weren’t trying to invent a new architecture. We were trying to remove bottlenecks. It turned out that when you remove the bottlenecks, what’s left is surprisingly simple.”
BERT and GPT: Two Paths Forward
The Transformer architecture was general—it didn’t specify what task to train on. This generality led to two influential but fundamentally different approaches.
Jacob Devlin at Google Research wanted systems that could understand language. Answer questions. Classify documents. Extract information. His key insight: to understand language, you need context from both directions. When you read “The bank by the river,” you use words before and after “bank” to know it means riverbank, not financial institution.
Devlin built BERT (Bidirectional Encoder Representations from Transformers) to read bidirectionally. Every word could attend to every other word—past, present, and future.
Training: Randomly mask 15% of words in sentences, predict the masked words using full context.
Example: “The [MASK] sat on the mat” → predict “cat” using all surrounding words.
Meanwhile, OpenAI wanted systems that could generate text. Alec Radford and colleagues built GPT (Generative Pre-trained Transformer) to generate word by word, left to right. Each word could only attend to previous words—you can’t peek at the future when generating.
Training: Predict the next word, always, using only previous words.
Example: Given “The cat sat on the,” predict “mat.”
This difference—bidirectional understanding vs. left-to-right generation—determined what each model could do naturally.
What Each Architecture Could Do
BERT excelled at understanding when you had the full text:
Task: “The animal didn’t cross the street because it was too tired.” Question: “What does ‘it’ refer to?”
BERT could use “tired” (after “it”) and “animal” (before “it”) simultaneously to resolve the reference. Bidirectional attention made understanding trivial.
But BERT couldn’t generate. Ask it to complete “The cat sat on the ___” and it had no mechanism for sequential generation. It was trained to fill in blanks when the full sentence was already there.
GPT excelled at generation:
Prompt: “Once upon a time, there was a village at the edge of a dark forest. One day, a stranger arrived and”
GPT would generate: “knocked on the mayor’s door. The mayor, suspicious of outsiders, hesitated before answering...” continuing naturally because generation was exactly what it was trained for.
But GPT couldn’t use future context. When processing “The bank by the river,” it hadn’t seen “by the river” yet. It had to decide what “bank” meant using only leftward context.
The Conventional Wisdom
In 2018-2019, the split seemed natural:
Use BERT for: Question answering, classification, information extraction—tasks where you have full text and need to understand it.
Use GPT for: Writing, dialogue, completion—tasks requiring sequential generation.
Different tools for different jobs. Understanding needs bidirectional context. Generation needs left-to-right structure.
The Unexpected Turn
Then GPT scaled. GPT-2 (2019) had 1.5 billion parameters, trained on 40GB of text. GPT-3 (2020) had 175 billion, trained on 570GB of text.
And something unexpected happened: GPT-3 became good at understanding tasks too.
It could answer questions, classify text, extract information—not by being fine-tuned on these tasks, but by treating them as text generation. The key was prompting: format the task as text that needs completing, and let the model generate the answer.
Question answering becomes generation:
Instead of a special QA architecture, you just format it as text to be continued:
Context: The animal didn't cross the street because it was too tired.
Question: Why didn't the animal cross the street?
Answer:GPT reads this as text—just like any other text it’s seen during training—and generates the natural continuation: “Because it was too tired.”
The model isn’t doing “question answering” as a special task. It’s doing what it always does: predict what text comes next. The answer is simply the most likely continuation after “Answer:”.
Classification becomes generation:
Review: "This movie was terrible. Bad acting, worse plot."
Sentiment: negative
Review: "This movie was amazing! I loved every minute."
Sentiment: positive
Review: "This movie was boring and went nowhere."
Sentiment:GPT generates the next word: “negative”.
Again, no special classification mechanism. The model has seen enough examples in the prompt to recognize the pattern. It generates the word that would naturally continue this pattern. Classification is just generating the right label.
Translation becomes generation:
English: Hello, how are you?
French: Bonjour, comment allez-vous?
English: The cat sat on the mat.
French:GPT generates: “Le chat s’est assis sur le tapis.”
Even though GPT wasn’t specifically trained as a translation system, it has seen enough multilingual text during training to learn translation patterns. Given the prompt format, it generates the French translation as the natural text continuation.
This was profound. A model trained purely on “predict the next word” could handle tasks designed for specialized understanding models. You could frame almost any task as text generation. Give examples in the prompt (few-shot learning), and GPT would learn the pattern and generate the appropriate response.
The model wasn’t given task-specific training or special architectures for QA, classification, or translation. It just generated text. But generating the right text, in the right format, solved these tasks.
By 2022, the generative approach dominated. ChatGPT (GPT-3.5) reached millions of users. GPT-4 set new benchmarks. The field chose the architecture that could both understand and generate, even if generation was the training objective.
The lesson: Generation at scale learns understanding as a byproduct. To generate coherent text, you must understand it. To continue a story, you must understand narrative. To complete a sentence sensibly, you must understand grammar and meaning.
BERT’s insight—that bidirectional context helps understanding—remained true. But GPT’s approach proved more general. Understanding became a special case of generation.
Two architectures from the same foundation. Both showed that Transformers, trained at scale, could learn language in ways that seemed impossible just years before. But only one architecture could do everything.
Why Scale Suddenly Mattered
Something strange happened as Transformers got bigger. Capabilities didn’t just improve incrementally. They emerged suddenly.
GPT-2 could barely do arithmetic. GPT-3 could solve simple math problems. GPT-4 could reason through complex scenarios. Each increase in scale brought qualitative shifts, not just quantitative improvements.
This wasn’t supposed to happen. The standard theory of learning was that capabilities scaled smoothly. More data and parameters should give gradually better performance, with diminishing returns.
But Transformers seemed to have discontinuities. Certain capabilities appeared suddenly at certain scales. Multi-step reasoning. Understanding analogies. Following complex instructions. These abilities weren’t present in smaller models, then appeared in larger ones.
Researchers called these “emergent capabilities.” The term was contentious—some argued nothing truly “emerged,” that these abilities were latent in smaller models but too weak to observe. Others insisted something qualitatively new appeared at scale.
The debate continues. But the empirical pattern is clear: bigger Transformers unlock capabilities that smaller ones don’t have.
Scaling laws quantified this. In 2020, researchers at OpenAI published analysis showing that Transformer performance followed predictable power laws with respect to model size, dataset size, and computational budget. Double the parameters, get predictable improvement. Double the training data, get predictable improvement.
This was remarkable. It suggested a recipe: want better performance? Scale up. Have compute budget? Make the model bigger, train it longer, use more data. The relationship was smooth and predictable, at least within the ranges studied.
Jared Kaplan, one of the authors of the scaling laws paper, was careful about extrapolating. “We see these trends holding over multiple orders of magnitude,” he said in a 2020 talk. “But we don’t know if they continue indefinitely. There could be a wall we haven’t hit yet.”
Still, scaling became a research strategy. Companies invested billions in larger training runs. GPT-3’s training cost was estimated at $4-12 million. Subsequent models cost tens or even hundreds of millions more.
This created a divide. Large companies could train huge models. Academic researchers could not. The capital requirements became a barrier to entry. Transformers democratized the architecture—anyone could implement one. But only a few organizations could train them at the scale where capabilities emerged.
Beyond Language
Though Transformers emerged from language research, the architecture proved general.
Vision Transformers treated images as sequences of patches. Instead of pixels processed by convolutional layers, images became 16×16 patches processed by self-attention. By 2021, Vision Transformers matched or exceeded convolutional networks on image classification.
This was surprising. Convolutional networks were designed for images, exploiting spatial structure. Transformers weren’t. They were general-purpose sequence processors. That they worked as well for vision suggested the architecture was more fundamental than anyone expected.
Multimodal models combined text and images. CLIP, released by OpenAI in 2021, learned to connect images and text descriptions by training on 400 million image-text pairs from the internet. It could classify images based on text descriptions it had never seen during training. DALL-E used similar ideas to generate images from text prompts.
Code models applied Transformers to programming languages. GitHub Copilot, powered by OpenAI’s Codex model, could generate entire functions from natural language descriptions. Programmers found it eerily capable—suggesting completions that demonstrated genuine understanding of code structure.
Scientific applications followed. AlphaFold 2, which solved the protein folding problem, used Transformer-like attention mechanisms to relate amino acid sequences to 3D structure. Drug discovery, materials science, theorem proving—all found uses for attention-based architectures.
The pattern was consistent. Wherever you had structured data with complex dependencies—language, vision, code, molecules—Transformers provided a powerful framework. The self-attention mechanism was generic enough to capture relationships across domains.
This generality hinted at something deeper. Perhaps attention wasn’t just a useful architectural choice. Perhaps it reflected something fundamental about how intelligence processes information—by selectively focusing on what matters, constructing context dynamically, accessing information directly rather than sequentially.
What We Built But Don’t Fully Understand
The Transformer’s success created a peculiar situation. We built systems that worked far better than theory predicted. We could train them, deploy them, use them. But we couldn’t fully explain why they worked as well as they did.
Why does self-attention work so well? Mathematically, it’s just weighted averages of value vectors, where weights come from comparing queries and keys. Nothing in the formulation obviously explains why this should be so powerful.
Why does scale unlock capabilities? The scaling laws are empirical observations, not derivations from first principles. We can measure the relationship between size and performance, but we don’t have a theory explaining why certain capabilities emerge at certain scales.
What are these models actually learning? The representations in a trained Transformer are distributed across millions of parameters. We can visualize attention weights, probe representations, analyze behavior. But the internal structure remains partially opaque.
This bothered some researchers. “We’re building increasingly powerful systems without understanding them,” observed one researcher at a 2021 workshop. “That seems like a path to unexpected behavior.”
Others were pragmatic. “Engineering has always preceded theory,” countered another. “Steam engines worked before thermodynamics. Airplanes flew before we fully understood aerodynamics. We’re in a similar position with Transformers.”
The debate continues. Theory is catching up—papers analyzing Transformer behavior, explaining emergent capabilities, connecting architecture to performance. But practice still leads.
What’s clear is this: Transformers work. They work at scales we didn’t anticipate. They unlock capabilities we didn’t predict. Whether we fully understand them or not, they’ve become the foundation of modern AI.
The Revolution’s Speed
Looking back, what’s striking is how fast this happened.
2017: “Attention Is All You Need” published.
2018: BERT and GPT prove Transformers work for understanding and generation.
2019: GPT-2 shows surprising capabilities. Companies begin major investments.
2020: GPT-3 achieves human-like text generation. Scaling laws are published.
2021: Vision Transformers match CNNs. Multimodal models connect text and images.
2022-2023: ChatGPT brings Transformers to millions of users. Claude, Gemini, and other systems follow.
Five years from publication to ubiquity. Five years from a skeptical conference presentation to systems that millions of people use daily.
Recurrent networks had dominated for a decade. Transformers replaced them almost completely in half that time. The shift was decisive, comprehensive, and fast.
Vaswani, reflecting in 2023, was modest about the achievement. “We were trying to solve a problem—make translation faster and better. The architecture worked beyond what we expected. But credit goes to everyone who built on it. We showed attention worked. Others showed it could scale. That’s how science progresses.”
True enough. But there’s a reason the paper has been cited tens of thousands of times. There’s a reason it won the Test of Time award at NIPS. The insight was simple, the execution clean, and the impact transformative.
Attention was all you needed. At least for sequences. At least at scale. At least for the capabilities we’ve explored so far.
The doors it opened remain wide. And we’re still discovering what lies beyond them.
Notes & Further Reading
The Transformer:
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). “Attention Is All You Need.” NeurIPS. The paper is surprisingly readable. The title captured everyone’s attention, but the elegant simplicity of the architecture is what made it last.
Attention Origins:
Bahdanau, D., Cho, K., & Bengio, Y. (2015). “Neural Machine Translation by Jointly Learning to Align and Translate.” ICLR. The first attention mechanism, showing that recurrent networks didn’t need to compress everything into a bottleneck. Attention could selectively access source information.
BERT:
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding.” Google’s approach to pre-training on general language, then fine-tuning on specific tasks. Became the standard paradigm for language understanding.
GPT Series:
Radford, A., et al. (2018). “Improving Language Understanding by Generative Pre-Training.” (GPT)
Radford, A., et al. (2019). “Language Models are Unsupervised Multitask Learners.” (GPT-2)
Brown, T., et al. (2020). “Language Models are Few-Shot Learners.” (GPT-3)
Each paper showed capabilities emerging at larger scales. GPT-3’s few-shot learning—giving examples in the prompt rather than fine-tuning—changed how people thought about deploying language models.
Scaling Laws:
Kaplan, J., et al. (2020). “Scaling Laws for Neural Language Models.” OpenAI’s analysis showing predictable relationships between model size, data size, compute budget, and performance. Controversial but influential. Suggested scale was a viable research direction.
Vision Transformers:
Dosovitskiy, A., et al. (2020). “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale.” Showed that pure Transformer architectures could match CNNs for vision. Suggested the architecture’s principles were domain-general.
On Understanding Transformers:
Why Transformers work so well remains partially mysterious. Some relevant recent work:
Elhage, N., et al. (2021). “A Mathematical Framework for Transformer Circuits.” (Anthropic)
Olsson, C., et al. (2022). “In-context Learning and Induction Heads.” (Anthropic)
But theory still lags practice. We can describe what Transformers do more easily than explain why they work.
The Human Element:
The Transformer paper lists eight authors. Each contributed crucial pieces: Vaswani’s leadership, Shazeer’s multi-head attention, Jones’s positional encodings, Gomez’s implementation, Uszkoreit’s insights on removing recurrence. Good research is collaborative. The paper’s elegance came from multiple minds working together.
On Emergent Capabilities:
Wei, J., et al. (2022). “Emergent Abilities of Large Language Models.” Documents capabilities that appear suddenly at certain scales. Controversial—some argue these abilities are continuous but hard to measure at smaller scales. Empirically, the discontinuities are striking.


