Intelligence Is Compression, Part 6: When Theory Meets the Real World
One framework, tested on images, motion, language, and 3D worlds
“The best theory is inspired by practice, and the best practice is inspired by theory.” – Donald Knuth
This is Part 6 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
Theory is only as good as its predictions. The previous five articles built a mathematical framework piece by piece: from PCA and sparse coding, through denoising and diffusion, to lossy compression and the rate reduction principle, and finally to the white-box transformer architecture CRATE. The question that remains is the one that matters most: does any of this work on real data?
The answer spans nearly 250 pages of the textbook and takes the theoretical framework out of the realm of Gaussian mixtures and toy examples. The tests involve images, 3D objects, human motion, and language. The results are not always state-of-the-art. The authors are honest about that. But they are competitive, and they come with something that state-of-the-art methods usually lack: a principled explanation of why they work.
How do you know you’re right?
The rate reduction principle tells you what a good representation looks like, and the unrolled optimization framework tells you how to build a network that finds one. But how do you know the representation you found is actually correct? How do you know it captures the full structure of the data, rather than just a convenient subset?
The first test is consistency. A representation is consistent if you can decode it back to something close to the original data. In other words: if you compress an image into a compact code, you should be able to reconstruct the image from that code with reasonable fidelity. This is the standard autoencoding criterion, familiar from variational autoencoders (VAEs) and related methods.
Self-consistency goes a step further. A representation is self-consistent if the following loop is stable: encode the data to get a representation, decode the representation to get a reconstruction, then encode the reconstruction again. If the second encoding matches the first, the representation is self-consistent. If it does not, something was lost in the round trip, and the representation is incomplete.
This idea, the closed-loop transcription framework, provides a natural quality check for learned representations. It also connects representation learning to distribution learning: a self-consistent encoder-decoder pair implicitly defines a generative model of the data. You can sample from it by generating random codes and decoding them. The quality of the generated samples tells you how well the representation captures the true distribution.
The variational autoencoder has its own origin story, and it is one of the cleaner examples of an idea arriving at the right moment. In December 2013, Diederik Kingma, a PhD student at the University of Amsterdam working under Max Welling, posted a paper to arXiv called “Auto-Encoding Variational Bayes.” Kingma, who goes by the Frisian nickname Durk, had found an elegant trick: if you reparameterize the random sampling in a latent variable model so that the randomness is injected as an external input, the entire system becomes differentiable and trainable by standard gradient descent. The trick was simple enough to fit in a few lines of algebra and powerful enough to launch a field. Within two years, Kingma had joined the founding team of OpenAI. His PhD, completed in 2017, earned a cum laude distinction, the first awarded by Amsterdam’s computer science department in thirty years. The VAE became one of the foundational architectures of generative modeling, and the compression framework now shows why it works: its training objective, balancing reconstruction quality against the regularity of the learned code, is an approximation to the self-consistency criterion that the compression framework demands. The encoder pushes data toward a structured representation. The decoder pulls it back. The balance between compression and reconstruction is the same balance the rate reduction principle describes: discard what does not matter, keep what does.
Generation as constrained search
The next step follows naturally: once you have a good representation, how do you generate new data? The answer is inference, treated as constrained optimization. You specify what you want (a text prompt, a partial image, a set of 3D constraints), express those requirements as constraints on the representation space, and solve for the code that satisfies them. The decoder then maps that code to data.
This is a different framing from the one most practitioners use. In standard diffusion-model practice, generation involves sampling from a learned noise-to-data mapping, guided by a classifier or a text embedding. The compression thesis reframes this as optimization: find the representation that best satisfies your constraints while remaining consistent with the learned data distribution. The difference is partly conceptual, but it matters for interpretability. When a generation fails, the optimization framework tells you why: either the constraints are incompatible, or the representation does not cover that region of data space.
Images, motion, language, 3D worlds
The full spectrum of applications spans images, 3D objects, human motion, and language. The results fall into several categories.
For unsupervised image representation, training a network to maximize rate reduction on image datasets produces features that rival those learned by DINO and other self-supervised methods. This matters because DINO was developed through extensive empirical tuning at Meta, while the rate-reduction-based approach derives from a single mathematical principle. The features have similar quality but come with a theoretical guarantee: they are approximately optimal with respect to the coding rate objective.
For supervised classification, CRATE achieves competitive accuracy on ImageNet and CIFAR-10 compared to standard Vision Transformers at similar model sizes. The accuracy is not always the highest reported number in the literature, but the gap is small, and the model is fully interpretable. Each layer has a known mathematical purpose, and the learned representations visibly converge toward the predicted structure: within-class features become more compact and between-class features become more orthogonal as you go deeper into the network.
For text-image binding, the rate reduction framework applies to the CLIP paradigm. CLIP, developed by OpenAI, learns to map images and text captions into a shared representation space where semantically matching pairs are close together and non-matching pairs are far apart. CLIP’s contrastive objective turns out to be an approximation to a mutual-information-based rate reduction objective. When you train a CLIP-style model using the principled rate reduction loss instead of the standard contrastive loss, the resulting representations are competitive and, in some settings, more structured.
For language, a causal variant of CRATE generates text one token at a time, the same way GPT does, but with each layer derived from the compression objective rather than designed by hand. The results are reasonable but represent the framework’s weakest domain. On standard language modeling benchmarks, the gap between CRATE and heavily engineered models like GPT-2 remains larger than in vision. The authors are candid about this: language appears to require something beyond what the current compression framework captures cleanly, whether that is the sheer scale of distributional structure in text, the role of discrete semantics, or both. The gap is the most honest measure of where the theory still has room to grow.
The most ambitious demonstrations involve 3D data and physical reasoning. The framework shows how to learn 3D representations from 2D images by combining the compression framework with geometric constraints. Given multiple views of a scene, the system learns a representation that is consistent across viewpoints, effectively recovering the 3D structure of the scene from 2D observations. This connects directly to the emerging field of spatial intelligence and world models: the idea that intelligent systems need an internal model of the physical world to reason about objects, navigation, and interaction. The claim is that a 3D world model is, at its core, a compressed representation of visual experience that is consistent across viewpoints and predictive across time.
Human motion is another domain where the framework is applied. Learning to represent the space of human body movements, walking, reaching, dancing, requires capturing both the kinematics, the geometry of joint angles, and the dynamics, the smoothness and physical plausibility of trajectories. The rate reduction framework, applied to motion capture data, produces representations that support generation of novel, realistic human motions.
What the numbers say
The results do not beat every benchmark. In several comparisons, the principled methods trail the best empirically-tuned systems by a few percentage points. The authors are transparent about this. The goal is not to set records, but to show that a single mathematical framework, applied consistently, produces competitive results across a wide range of data types and tasks, and that the results come with interpretability and theoretical guarantees that black-box methods cannot offer.
There is also a practical argument worth stating explicitly. Systems designed from first principles tend to improve more predictably. When you understand why a system works, you know where to invest effort to make it better: improve the dictionary learning at layer five, increase the precision parameter, add more subspace heads. When a black-box system fails, you are left with ablation studies and hyperparameter sweeps. The implicit bet is that the principled approach will overtake the empirical one as the theory matures and the engineering catches up. The history of other engineering disciplines, from bridge design to aerospace to signal processing, suggests this bet tends to pay off over time.
The breadth of the results is itself an argument. The same framework that learns unsupervised image features also learns text-image binding, 3D world models, human motion representations, and language models. No single component was designed for any specific domain. The rate reduction principle, the unrolled optimization framework, and the self-consistency criterion apply uniformly across all of them. This generality is the strongest evidence for the compression thesis: not that it works in any one domain, but that it works across all of them, because the underlying principle, compress data by exploiting its low-dimensional structure, is universal.
The final article in this series steps back from results and asks the hardest question: what does the compression thesis not explain? The framework handles perception and pattern recognition with mathematical rigor. But what about reasoning, autonomy, and scientific discovery? The most speculative part of the compression thesis is also its most honest. It draws a map of what we understand about intelligence and marks the vast territories we do not.


