Intelligence Is Compression, Part 7: What We Still Don’t Understand
The three frontiers the framework does not reach
“The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” – Proposal for the Dartmouth AI program, 1956
This is Part 7 (final) in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
Six articles in, here is what we know. The compression thesis explains a great deal. It explains why PCA works: find the subspace that captures the most variance. It explains why sparse coding produces features that resemble the brain’s visual cortex: optimize for the most efficient code. It explains why diffusion models generate realistic images: each denoising step is a compression step, and generation runs the diffusion process backward. It explains why contrastive learning organizes data into useful clusters: maximize the information gained by separating classes. And it derives the transformer architecture from first principles by unrolling the optimization of the rate reduction objective.
These are not small results. Together, they constitute the most comprehensive mathematical framework for understanding modern deep learning that exists. The argument has earned the right to make a strong claim: the major architectures of deep learning are approximate solutions to the same underlying optimization problem, and that problem is compression.
But the final stretch of the compression thesis is its most honest, and perhaps its most important. It draws a clear line between what the framework explains and what it does not. The compression thesis covers what you might call inductive intelligence: learning patterns from data, building representations, making predictions based on those representations. This is the intelligence of perception, of pattern recognition, of generalization from examples. It is the intelligence that modern AI systems possess.
It is not the whole of intelligence. Not even close.
Three open frontiers
Three frontiers lie beyond the compression framework’s current reach, and each requires something different.
The first is autonomous intelligence. Every system discussed in this book is trained offline on a fixed dataset. The data is collected, the model is trained, and the model is deployed. If the world changes after training, the model does not adapt unless a human intervenes to retrain it. Real intelligence, the kind that animals possess, works differently. An animal acts in the world, observes the consequences, updates its internal model, and acts again. The loop is closed. The learning is continuous.
The taxonomy frames this as the transition from open-loop models to closed-loop systems. An open-loop model memorizes a static distribution. A closed-loop system continuously corrects its distribution based on error feedback. The difference is not incremental. It is structural. The formulation is candid: “open-ended models are for a closed world, however large; closed-loop systems are for an open world, however small.” No amount of data, no matter how much, makes an open-loop model truly general. What makes intelligence general is the ability to correct itself.
This connects directly to the self-consistent representations from Part 6, and it points toward a synthesis of the compression framework with reinforcement learning. The rate reduction principle tells you what to learn. The closed-loop architecture tells you how to keep learning. The combination is a complete memory system, and it does not yet exist in artificial form.
The second frontier is natural intelligence. Every system discussed so far is trained with backpropagation: compute the loss at the output, propagate gradients backward through every layer, update every parameter simultaneously. This works. It is also biologically impossible. Biological brains do not have a mechanism for propagating precise gradient signals backward through millions of neurons. Learning in the brain is local: each neuron updates based on the signals it directly receives, not on a global error signal computed at the output.
The question is whether the compression principle can be realized through local, biologically plausible learning rules. The compression thesis sketches an answer. The cortex consists of tens of thousands of cortical columns, each with similar structure, operating largely in parallel with sparse interconnections. The hypothesis is that this architecture is a massively distributed system of closed-loop autoencoders, each extracting information from its inputs and maximizing information gain in its outputs. If this hypothesis is correct, the compression principle is not just a mathematical abstraction. It is a description of what the cortex actually does, implemented not through backpropagation but through local feedback within each module.
This would represent an evolution in the computational complexity of intelligence, from Kolmogorov’s incomputable complexity through Shannon’s entropy, Turing’s computability, and modern scalability, toward something called “natural”: intelligence that is not just tractable or scalable but realizable under the physical constraints of biology. The gap between current AI and natural intelligence is not just about scale. It is about architecture and optimization.
The third frontier is scientific intelligence. This is the most speculative territory, and the most provocative. The compression thesis proposes a taxonomy of intelligence with four levels: phylogenetic, meaning evolution through DNA; ontogenetic, meaning individual learning through experience; societal, meaning collective knowledge through language; and scientific, meaning proactive discovery through hypothesis and deduction. The compression framework covers the first two levels and arguably parts of the third. It does not cover the fourth.
Scientific intelligence is qualitatively different. It requires not just learning patterns from data but forming abstract concepts, constructing hypotheses, and verifying them through logical deduction or experimentation. A pointed question follows: can state-of-the-art language models truly understand the notion of numbers, or have they merely memorized vast quantities of examples? Can they distinguish mathematical induction from Bayesian inference, or do they treat all reasoning as pattern matching over training distributions?
These are not philosophical questions. They are empirical ones that should be answered by rigorous tests. The compression thesis suggests three such tests, named after three intellectual pioneers. The Wiener Test: can a system self-correct and autonomously create new empirical knowledge, or does it only update when externally supervised? The Turing Test, in a more rigorous formulation than the original: can a system create and understand abstract concepts, or does it merely memorize statistical patterns? And the Popper Test: can a system proactively generate new hypotheses and verify or falsify them through deduction or experimentation?
These tests are ordered by difficulty. Current AI systems might pass a well-designed Wiener Test with modest extensions. The Turing Test, as defined here, remains far out of reach. The Popper Test is further still. The contribution is not to answer these questions but to state them precisely, in mathematical terms that could, in principle, be studied with the same rigor that the rest of the compression thesis applies to representation learning.
The struggle against entropy
Part 1 of this series opened with a quote from Václav Havel: “Just as the constant increase of entropy is the basic law of the universe, so it is the basic law of life to be ever more highly structured and to struggle against entropy.” The argument of this series is that this struggle has a mathematical description. Intelligence, at least the inductive kind, is the process of compressing high-dimensional, noisy observations into low-dimensional, structured representations. Every major architecture of modern deep learning, from sparse coding to diffusion models to transformers, is a different implementation of this process.
Is compression sufficient for all forms of intelligence? The honest answer is that nobody knows. But the question can now be studied scientifically rather than debated philosophically. The compression framework provides a precise language for describing what current AI can and cannot do. It provides a clear boundary between the explained and the unexplained. And it provides a direction for extending the theory: close the loop, make the learning local, and find the mechanisms that turn inductive inference into deductive reasoning.
The framework is incomplete. It says nothing about consciousness, nothing about intention, nothing about the spark that turns pattern recognition into understanding. These are the hardest questions, and they remain open.
But something has been accomplished. For the first time, we have a unified mathematical theory that explains why the major families of deep learning work, that derives their architectures from a single principle, and that draws a precise map of what lies within the theory’s reach and what lies beyond it. The map is valuable even where it shows blank space, because now we know exactly where the frontier is.
The 1940s gave us the foundational ideas: Shannon’s information theory, Wiener’s cybernetics, von Neumann’s computing architecture. The decades that followed turned those ideas into engineering. The deep learning revolution of the 2010s seemed to outrun theory entirely, producing systems that worked but could not be explained. This book is an attempt to close the gap, to show that the engineering successes were not mysterious but were natural consequences of a principle that Shannon and Wiener would have recognized.
The tools are modern. The mathematics is new. But the underlying insight is as old as the Havel quote: intelligence is the struggle against entropy. The struggle continues.


