Intelligence Is Compression, Part 1: One Principle
One Berkeley lab thinks deep learning already has a unified theory
“Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy.” – Václav Havel
This is Part 1 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
In 2023, while the rest of the AI world was racing to build ever-larger language models, a professor named Yi Ma published a paper with an unusually provocative subtitle: “Compression Is All There Is?”
The paper introduced CRATE, a transformer-like architecture derived entirely from mathematics. Not discovered through trial and error, not found by neural architecture search, not stumbled upon by scaling up and seeing what works. Derived. From a single objective function that says: the goal of a neural network is to compress its inputs into a structured, compact representation.
The question mark in the title was doing a lot of work. Ma was not claiming that compression explains everything about intelligence. He was asking whether it might, and then showing the math to back the question up.
That paper was a preview. The full argument arrived in 2025 as an open-source textbook, now in its second edition, written with Sam Buchanan, Druv Pai, and Peng Wang. Its subtitle is even bolder: “A Mathematical Theory of Memory.” Across nine chapters, the book attempts something that most AI researchers have given up on: a unified theoretical framework that explains why the major architectures of modern deep learning work. Not just transformers. Diffusion models, autoencoders, DINO, CLIP. All of them.
The claim is that these architectures, developed independently by different teams for different purposes, are all approximate solutions to the same underlying problem. And that problem is compression.
Most of your pixels are wasted
The book opens with a deceptively simple observation. The world we live in is structured. Not perfectly predictable, but far from random. Objects have edges. Surfaces have textures. Light follows physical laws. Gravity pulls things down. Sentences follow grammar. Faces have two eyes.
This matters because it means the data we observe, images, sounds, text, sensor readings, contains far less information than its raw size would suggest. A photograph might have millions of pixels, each capable of taking any color value. But the set of photographs that actually look like something occupies a vanishingly thin slice of all possible pixel combinations. The rest is noise: random static that no camera would ever produce and no eye would ever see.
The fundamental task of any intelligent system is to find that thin slice. To learn which patterns in the data are real, reflecting the structure of the world, and which are accidental, reflecting noise, sampling quirks, or irrelevant variation. In mathematical terms: to discover that high-dimensional data actually lives on a low-dimensional surface.
This is “learning a representation.” Not memorizing data, but compressing it. Finding the compact encoding that preserves what matters and discards what doesn’t.
Here is a way to make this concrete. Imagine you are shown ten thousand photographs of cats. Each image is a grid of millions of pixels. In principle, that grid could display anything: random noise, a Jackson Pollock painting, a spreadsheet. But cat photos share enormous amounts of structure. They have fur textures, ear shapes, eye placements, body proportions that vary in predictable ways. A good representation of “cat photographs” would not store every pixel of every photo. It would capture the underlying dimensions of variation: how ears can be pointy or rounded, how fur can be striped or solid, how the body can be curled or stretched. A few dozen such dimensions might be enough to reconstruct any cat photo with reasonable fidelity, even though the raw pixel data lives in a space with millions of dimensions.
The gap between the raw data’s dimensionality and the representation’s dimensionality is the compression. And the book’s claim is that learning this compression is not just one thing neural networks do. It is the thing.
Shannon, Wiener, and the math that was missing
The intuition that intelligence involves compression is not new. Horace Barlow proposed in 1961 that the goal of sensory processing is to reduce redundancy, to find an efficient code for the information carried by neural signals. Claude Shannon’s information theory, formalized in 1948, was already about the limits of compression. Norbert Wiener’s cybernetics program in the 1940s identified the core mechanism of all intelligence as closed-loop feedback: observe the world, build a model, check your predictions, correct your errors.
What is new is the claim that this intuition can be made mathematically precise, and that when you do so, the resulting theory generates the architectures of modern deep learning as its natural solutions.
Consider what this means. The transformer architecture, the backbone of GPT, Claude, and virtually every large language model, was discovered empirically by a Google team in 2017. They tried various configurations. Attention heads, feed-forward layers, skip connections, layer normalization. It worked spectacularly well. But no one could explain, from first principles, why those specific components were the right ones.
Ma’s framework offers an answer. If you start from the compression objective, make representations compact within each class and spread apart between classes, and you try to optimize it iteratively, each iteration step naturally produces two operations. The first looks like multi-head self-attention: it compresses the representation. The second looks like a feed-forward MLP: it sparsifies the representation. Stack these iterations into layers, and what emerges is a transformer. Not because someone designed it that way, but because the math demands it.
The architecture that Vaswani and colleagues found by experiment, Ma’s group reached by derivation. The match is not exact: the derived version uses subspace projections instead of softmax attention, and soft thresholding instead of ReLU. But the structural correspondence is close enough to demand explanation.
Diffusion, CLIP, DINO: one principle
The transformer derivation is the most dramatic result, but it is not the only one. The same compression principle applies across the full landscape of modern deep learning.
Diffusion models, the technology behind image generators like Stable Diffusion and DALL-E, work by adding noise to images until they become pure static, then learning to reverse the process step by step. This denoising process is a form of compression: each step pushes the data back toward its low-dimensional structure. The network can only remove the noise if it has learned where the real data lives.
Contrastive learning methods like DINO, which learn useful image features without any labels, turn out to be approximate solutions to the same compression objective: make representations of similar images compact, and representations of different images spread apart.
Even CLIP, OpenAI’s system that links images to text descriptions, can be understood through the lens of mutual information and compression: find the representation that preserves what images and their descriptions share while discarding what they don’t.
Each of these methods was developed independently, by different research teams, for different purposes. The book’s argument is that they all converge on the same underlying principle, and that this convergence is not coincidence. It is evidence that the principle is right.
Yi Ma’s long game
To understand why this book exists, you need to understand Yi Ma’s career. He received his PhD from Berkeley in 2000, spent a decade at the University of Illinois building a research program around sparse representation and high-dimensional geometry, then led the Visual Computing group at Microsoft Research Asia from 2009 to 2014. He returned to Berkeley as a professor in 2018, then moved to the University of Hong Kong in 2023, where he now directs the School of Computing and Data Science while maintaining a visiting appointment at Berkeley.
Throughout this entire career, Ma has been working on the same fundamental question: how do you find low-dimensional structure in high-dimensional data? His early work on robust face recognition, which represented faces as sparse combinations of training images, was among the most cited papers in computer vision. His 2022 textbook with John Wright, High-Dimensional Data Analysis with Low-Dimensional Models, laid the mathematical groundwork. The new book is the capstone: applying that mathematical framework to deep learning.
Ma’s position in the field is distinctive. He is a mathematician first, deeply committed to understanding why things work rather than simply demonstrating that they do. At a moment when the dominant culture in AI rewards scaling, engineering, and benchmark performance, Ma’s lab is doing something unfashionable: insisting that theory should lead practice, not follow it.
The results suggest the unfashionable approach might be right. The CRATE architecture, derived entirely from the compression principle, achieves competitive performance on image classification while being fully interpretable. On language modeling, the results are reasonable but a gap remains, one the authors acknowledge as an active research frontier. Every layer has a known mathematical purpose. You can look inside a trained CRATE and understand what each component is doing. It is a white box in a field of black boxes.
What compression does not explain
The authors are honest about what they do not explain. The final chapter proposes a taxonomy of three levels of intelligence. The first is pattern recognition: learning data distributions from observations. The second is autonomous self-correction: closing the feedback loop so the system updates itself. The third is scientific reasoning: forming hypotheses, conducting deductive proofs, discovering new laws. The compression framework covers the first level thoroughly. It offers a starting point for the second. It barely touches the third.
This honesty is part of what makes the argument credible. The argument does not claim to have solved intelligence. It claims to have a mathematical theory of one essential component of intelligence, the part that learns structured representations from data, and to have shown that this theory already explains a surprising amount of what modern AI systems actually do. The open question is whether the same principle, extended or combined with other mechanisms, can reach the higher levels. These questions, it proposes, can be studied scientifically rather than left as philosophy. That proposal itself is worth something.
What this series will cover
The book spans nine chapters. This series will not walk through them one by one. Instead, it follows the ideas along a narrative arc, from the simplest cases to the open frontier.
The next article starts at the beginning: why high-dimensional data has low-dimensional structure, and how the simplest algorithms, principal component analysis and sparse coding, exploit that structure. From there, we will trace how the compression principle generates the major families of modern deep learning: denoising and diffusion models, lossy coding and contrastive learning, white-box transformers, and self-consistent representations. We will see the theory applied to real-world data: images, 3D objects, human motion, natural language. And we will end with what the compression thesis says about the forms of intelligence it does not explain: autonomous learning, biologically plausible computation, and scientific reasoning.
The subtitle of the book, “A Mathematical Theory of Memory,” reveals the ambition. Memory here does not mean remembering facts. It means the structured, compressed encoding that an intelligent system builds of its world. The book’s claim is that we now have the mathematics to describe how that encoding is created, what form it takes, and why it works.
Whether the claim holds up is what the next six articles will examine. But the provocation is worth taking seriously. In a field that often moves by intuition and brute force, here is an argument that the theory was there all along, waiting for someone to write it down.
There is something appealing about the timing. The 1940s gave us the foundational ideas: Shannon’s information theory, Wiener’s cybernetics, von Neumann’s computing architecture. The decades that followed turned these ideas into engineering disciplines. And then the deep learning revolution arrived in the 2010s, and for a while it seemed like engineering had outrun theory entirely. Networks worked, but nobody could say why. Architecture choices were made by instinct, validated by benchmarks, justified by results.
Ma’s book is an attempt to close that gap. To show that the engineering successes of the past decade are not mysterious, but are in fact the natural consequences of a principle that Shannon and Wiener would have recognized. The tools are modern. The mathematics is new. But the underlying insight, that intelligence is the struggle against entropy, is as old as the Havel quote that opens the book.
The struggle continues. This series follows it.



This is a great summary, and highly accessible. It reminds us a bit of Ilya's talks on the Kolmogorov complexity, and how you need an intuition about it to really understand what is being compressed during pre-training, but this breaks new ground by actually being comprehensible to a non-mathematician!