Intelligence as Compression
The Prediction Engine
This is Chapter 11 of A Brief History of Artificial Intelligence.
In the summer of 1948, Claude Shannon was trying to win a bet.
Shannon worked at Bell Labs in Murray Hill, New Jersey—the research campus where the transistor had been invented the year before, where scientists wandered between buildings thinking about everything from stellar physics to telephone switching. Shannon’s office was a chaos of gadgets: mechanical mice, juggling machines, a calculator that worked in Roman numerals. His colleagues called him a genius and a prankster in equal measure.
The bet was simple. Shannon claimed he could predict what letter would come next in any English text with surprising accuracy. His colleague, Warren Weaver, was skeptical. So they ran an experiment.
Weaver would show Shannon a piece of text, letter by letter. After each letter, Shannon would guess the next one. “T”—Shannon guessed “H”. Correct. “TH”—Shannon guessed “E”. Correct. “THE”—Shannon guessed a space. Correct. “THE “—Shannon guessed “C” or “M” or “A”. After the space, many letters were plausible.
By the end of the experiment, Shannon had guessed correctly about 75% of the time. English, it turned out, was highly predictable—which meant it was highly redundant. The information content of English was far less than five bits per letter. Most of what we write is predictable from what came before.
Shannon published these findings in “A Mathematical Theory of Communication,” a paper that created the field of information theory. The key insight was deceptively simple: information is surprise. If you can predict what comes next, the next symbol carries little information. If you can’t predict it, it carries a lot. Prediction and information are two sides of the same coin.
Seventy years later, this insight would explain why systems trained to predict the next word could become so capable. Prediction, it turned out, was not just about guessing. It was about understanding.
The Compression Connection
Shannon’s insight linked prediction to compression.
If you can predict what’s coming, you don’t need to transmit it—the receiver can reconstruct it themselves. Every predictable pattern is an opportunity to save bits. A compression algorithm is, in effect, a prediction machine: it finds patterns that let it transmit messages in fewer bits than the original.
This meant that compression and understanding were related. To compress text well, you had to understand its structure—the grammar of language, the logic of arguments, the patterns of thought. A system that compressed English text would have to learn something about English. A system that compressed Wikipedia would have to learn something about the world.
In 2006, a mathematician named Marcus Hutter decided to test this idea.
Hutter was an unusual figure in AI research. Trained in physics, he had become obsessed with the mathematical foundations of intelligence. While others built practical systems, Hutter worked on theory—impossibly abstract frameworks that described optimal learning agents requiring infinite computational resources. His colleagues respected his brilliance and questioned his relevance.
Then Hutter did something concrete. He announced a prize: €50,000 for anyone who could compress a specific file—100 million characters from Wikipedia—smaller than the previous record.
The file was arbitrary. The prize money came from Hutter’s own pocket. But he believed he was measuring something fundamental.
“If you can compress Wikipedia,” Hutter explained, “you must understand it. You must understand language, grammar, facts, relationships between concepts. Compression is a proxy for intelligence. Maybe the best proxy we have.”
The Hutter Prize attracted a small community of obsessives. Year after year, contestants submitted new compression algorithms, each one squeezing the file a bit smaller. Each improvement came from finding new patterns—regularities in text that previous algorithms had missed. A better compressor was literally a system that understood the text better.
The winners weren’t building AI in any conventional sense. They were building prediction machines. But the insight underlying the prize would prove prophetic: prediction was the path to capability.
Kolmogorov’s Question
Back in the 1960s, while Shannon’s information theory was transforming communications, a Soviet mathematician named Andrey Kolmogorov was asking a related question from a different angle.
Kolmogorov was already a legend. Working from Moscow State University, in an office filled with books and papers and the quiet intensity of deep thought, he had made foundational contributions to probability theory, topology, and the mathematics of turbulence. He trained generations of mathematicians who revered him as a teacher and feared his exacting standards. When Kolmogorov decided a problem was interesting, his students knew to pay attention.
In 1963, Kolmogorov turned his attention to a question that seemed almost philosophical: what is the shortest possible description of a piece of data?
Consider the number π. Its digits go on forever—3.14159265358979...—seemingly random, without visible pattern. If you looked at a billion digits of π, you would find no obvious structure, no repetition, no rule you could state simply. And yet there’s a sense in which π is profoundly simple. You can write a short program that computes π to any desired precision. The program might be only a few hundred characters long—a recipe using the Gregory-Leibniz series, or Machin’s formula, or any of dozens of algorithms. This program is a compressed representation of π—a short description that captures infinite data.
Kolmogorov defined the complexity of data as the length of the shortest program that produces it. A billion digits of π have low Kolmogorov complexity, because they can be generated by a short program. The apparent randomness is an illusion; there is deep structure beneath. A billion truly random digits have high Kolmogorov complexity, because there’s no pattern to exploit. No short program can generate them. The shortest description is the digits themselves.
This distinction mattered enormously for understanding intelligence. The world is full of apparent complexity that’s actually simple—patterns disguised as randomness. The orbits of planets look complex but follow Newton’s laws. The shapes of snowflakes look random but emerge from crystal physics. The behavior of markets looks chaotic but follows supply and demand. Intelligence, in Kolmogorov’s view, is the ability to see through apparent complexity to underlying simplicity. To find the short program. To compress.
Kolmogorov worked in the Soviet Union, developing his ideas within a different intellectual tradition than Shannon’s. Shannon had arrived at information theory through engineering—the practical problem of transmitting messages efficiently. Kolmogorov approached from pure mathematics—the abstract question of what it means to describe something. Shannon gave us a probabilistic measure of information; Kolmogorov gave us an algorithmic measure of complexity. The two frameworks were complementary, and they would eventually converge in modern learning theory and the compression view of intelligence.
Kolmogorov’s work remained abstract for decades—beautiful mathematics with no practical application. Then large language models made it concrete. A language model trained to predict text is searching for patterns—regularities that let it compress descriptions of what comes next. The better it compresses, the better it predicts. The better it predicts, the more it must have found the underlying structure. Kolmogorov’s theoretical framework had become a recipe for artificial intelligence.
The Prediction Engine
When Ilya Sutskever trained language models at Google and then OpenAI, he believed prediction was more powerful than it seemed.
Sutskever had studied under Geoffrey Hinton, one of the founders of deep learning. He was known for conviction that bordered on faith—a certainty that neural networks would lead to artificial general intelligence, even when the evidence was thin. In 2015, he co-founded OpenAI with Sam Altman, driven by this belief.
“Prediction is compression,” Sutskever explained in talks during the 2010s. “Compression is understanding. If you can predict the next word better than anyone else, you must understand the text better than anyone else. You must understand what’s being talked about, how ideas connect, why one thing follows another.”
Many researchers were skeptical. Prediction seemed too simple. Surely understanding required more than guessing what word came next. Surely it required grounding in the world, interaction with reality, some form of embodied experience.
But the systems that emerged from next-word prediction confounded expectations. GPT-2 could write coherent paragraphs. GPT-3 could answer questions, write code, translate languages. GPT-4 could pass professional exams. All of them were doing the same thing: predicting text. All of them were, in effect, compressing language into patterns that enabled prediction.
Consider what’s required to predict well. Given “The capital of France is,” predict “Paris.” To do this reliably, the model must have learned facts about geography. Given “If it’s raining, you should bring an,” predict “umbrella.” The model must have learned causal relationships about weather and behavior. Given “The patient presents with fever, cough, and chest pain, suggesting,” predict “pneumonia.” The model must have learned patterns of medical reasoning.
Every piece of knowledge that helps predict text becomes part of the model. Every pattern in human thought that appears in human writing becomes something the model can capture. Prediction is a universal interface to understanding—not because prediction is understanding, but because understanding improves prediction, and so the drive to predict creates the drive to understand.
Schmidhuber’s Curiosity
Jürgen Schmidhuber had been making this argument since the 1990s, long before anyone would listen.
Schmidhuber worked at IDSIA in Switzerland, a small AI lab nestled in the hills above Lugano. He had helped invent long short-term memory (LSTM), a neural network architecture that could learn long-range patterns—a breakthrough that later enabled much of modern sequence modeling. He was brilliant, controversial, and prone to claiming credit in ways that irritated his colleagues.
But his theory of curiosity was genuinely original.
“The brain is a compression machine,” Schmidhuber argued in papers dating back to 1991. “We are driven to find patterns, to discover regularities, to compress our experience into simpler representations. This drive is curiosity. This drive is intelligence.”
In Schmidhuber’s view, the pleasure of learning was literally the pleasure of compression. When you have an “aha” moment—suddenly understanding something that confused you—you’ve found a simpler description of your experience. The flash of insight is the feeling of compression succeeding.
This applied to art, to science, to humor. A beautiful theory is one that compresses many observations into few principles—E = mc² captures vast phenomena in five characters. A good joke works by setting up expectations and then compressing them into an unexpected connection. A great painting captures visual experience in essential forms. What we call beauty, Schmidhuber argued, is often compressibility—the recognition of deep pattern.
The theory explained why prediction tasks trained general capabilities. A system driven to predict well would discover patterns. Discovering patterns is compression. Compression is understanding. The humble task of guessing the next word, pursued relentlessly at scale, would lead to something like intelligence.
Why Scale Unlocks Capability
The connection between compression and capability explains why scale matters so much—and why the scaling laws that Kaplan discovered in 2020 take the form they do.
A small model can find simple patterns—the regularities that appear constantly in text. “The” follows “in” frequently. Questions end with question marks. Subjects precede verbs. These patterns are easy to find because they’re everywhere. A small model learns them quickly.
A large model can find complex patterns—the subtle regularities that appear rarely but meaningfully. The pattern of legal reasoning. The structure of mathematical proof. The way a doctor’s notes lead to a diagnosis. These patterns appear less frequently, require more data to identify, demand more parameters to represent. Only a model with sufficient capacity can capture them.
More parameters mean more capacity to represent intricate regularities. More training data means more examples of those regularities. More compute means more thorough search for the patterns that compress best. Scale enables deeper compression. Deeper compression enables better prediction. Better prediction enables broader capability.
When GPT-3 suddenly demonstrated few-shot learning—solving tasks from just a few examples—it had found patterns too subtle for smaller models to detect. The pattern “here are examples of a task, now apply the pattern to a new case” was present in the training data, scattered across millions of documents. But only a model with sufficient capacity could recognize it, represent it, and apply it. The capability emerged because scale enabled deeper compression.
The same logic explains emergence more generally. A pattern might be present in language but rare—appearing only occasionally across billions of words. A small model can’t afford to represent rare patterns; it has limited capacity and must focus on common ones. A large model can represent both common and rare patterns. At some threshold of scale, the rare pattern becomes representable, and a new capability appears. The capability was always latent in the data. Scale made it visible.
This isn’t magic. It’s the mathematics of compression and approximation, the same mathematics that Shannon and Kolmogorov discovered decades ago. More capacity means more patterns. More patterns mean better prediction. Better prediction means broader capability. The scaling laws are manifestations of this deeper truth—precise mathematical expressions of how compression improves with scale.
The Limits of Compression
Kolmogorov proved something important: most data is incompressible.
For any compression algorithm, there exist sequences that cannot be compressed at all. Random noise, by definition, has no patterns to exploit. The shortest description of random data is the data itself.
This suggests a limit on what prediction can achieve. Intelligence—compression—works when data has structure. For genuinely random data, there’s nothing to learn. For arbitrary tasks with no pattern, no amount of scale will help.
But here’s the key insight: the world is not random. Language has grammar. Images have objects. Physics has laws. Human behavior has patterns. The data AI systems learn from is highly structured—compressible in principle. The question isn’t whether pattern exists, but whether the model can find it.
Current language models have found enough pattern to pass bar exams and write essays. They haven’t found enough pattern to consistently solve novel mathematical proofs or make scientific discoveries. The frontier keeps moving. More scale finds more pattern. But somewhere there may be limits—places where the structure runs out, where no more compression is possible.
Whether we’ll hit those limits, and what happens when we do, remains to be seen.
Closing: The Shortest Description
This chapter began with Shannon winning a bet about predicting letters, and ends with a hypothesis about intelligence.
Shannon showed that prediction and compression are the same thing mathematically—to predict is to compress, to compress is to have found patterns. Kolmogorov showed that complexity is description length—understanding is finding short programs for complex data. Hutter showed that compression could be a benchmark for intelligence—his prize rewards systems that understand Wikipedia better. Schmidhuber showed that the drive to compress might explain curiosity itself—the pleasure of learning is the pleasure of finding patterns.
Large language models are compression engines. Trained to predict words, they learn to compress language—to find the regularities that allow efficient prediction. The better they compress, the more capable they become. This isn’t a coincidence. It’s the mechanism.
Prediction seemed like a humble task—just guessing what word comes next. But guessing well requires understanding. Understanding is compression. Compression is finding the short program that generates the data. The systems that predict best are the systems that have found the deepest structure in human language and thought.
Intelligence, it turned out, was compression. To understand the world was to predict it. To predict it was to compress it. And the machines that compressed best were starting to look remarkably intelligent.
Notes and Further Reading
On Shannon and Information Theory
Shannon’s “A Mathematical Theory of Communication” (1948) created a field. For the human story, Jimmy Soni and Rob Goodman’s “A Mind at Play” captures Shannon’s playful genius—the juggling, the unicycle, the maze-solving mechanical mouse. The letter-guessing experiments revealed something profound about redundancy in language that would later prove central to understanding why language models work.
On Kolmogorov Complexity
Kolmogorov developed his ideas independently of Shannon. The standard reference is Li and Vitányi’s “An Introduction to Kolmogorov Complexity and Its Applications,” though it’s mathematically demanding. The key intuition—that complexity is description length, that understanding is compression—is accessible without the formalism.
On the Hutter Prize
Marcus Hutter’s prize remains active, with the reward increasing as records fall. The prize website documents the winners and their techniques. Hutter’s book “Universal Artificial Intelligence” provides the theoretical framework, describing an optimal agent that compresses maximally—though the agent is uncomputable and thus purely theoretical.
On Schmidhuber and Compression-Based Curiosity
Schmidhuber’s papers on intrinsic motivation and curiosity date back to 1991. His broader work on recurrent networks and LSTM laid foundations for modern sequence modeling. For his theory of compression as the basis of intelligence and aesthetics, his talks and popular writings are more accessible than the technical papers.
On Sutskever and Prediction as Understanding
Ilya Sutskever’s views on prediction have been articulated in various talks and interviews, particularly during his time as Chief Scientist at OpenAI. The claim that prediction implies understanding remains contested—critics argue that prediction can succeed through pattern matching without genuine comprehension. The debate mirrors the broader question of whether language models truly understand.


