Intelligence Is Compression, Part 5: Building the White Box
Deriving the transformer from a compression objective
“What I cannot create, I do not understand.” – Richard Feynman
This is Part 5 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
Start from the rate reduction objective introduced in Part 4: a good representation should make different categories compact and spread them apart. Optimize that objective iteratively, one gradient step at a time. Unroll each step into a layer of a neural network. What emerges has multi-head self-attention, feed-forward MLP blocks, skip connections, and layer normalization. The transformer architecture, discovered empirically by a Google team in 2017 through careful engineering, can be derived mathematically from a compression principle.
That is the most dramatic claim in the compression thesis. The architecture that Vaswani and colleagues found by experiment, Ma’s group reached by derivation. The convergence is not exact: CRATE, the derived architecture, uses subspace projections where the standard transformer uses softmax attention, and soft thresholding where the standard transformer uses ReLU. But the structural correspondence, attention for compression, MLP for sparsification, skip connections for incremental updates, is striking enough to demand explanation. Either the match is coincidence, or the transformer is close to the natural architecture for learning compressed representations. The compression thesis bets on the latter.
Algorithms that become architectures
We have seen this pattern before. In Part 2, the ISTA algorithm for sparse coding, when unrolled, produces a sequence of layers that each apply a linear transform followed by a nonlinear threshold, which looks remarkably like a residual network. Even earlier, the power iteration algorithm for PCA showed that repeated application of a simple operation converges onto structure. The full rate reduction objective generalizes this idea, and the result is far more surprising than the earlier cases.
The logic is as follows. You have an objective function, rate reduction, that measures how good your representation is. You want to find the representation that maximizes it. One approach is gradient ascent: start with the raw data, compute the gradient of the objective with respect to the representation, take a step in the direction of the gradient, and repeat. Each step improves the representation incrementally. If you run enough steps, you converge to a local optimum.
Now, treat each gradient step as one layer of a network. The input to layer one is the raw data. The output of layer one is the data after one step of gradient ascent on the rate reduction objective. The output of layer two is the result after a second step. And so on. The composition of all layers, from input to final output, is a deep network that transforms raw data into an optimized representation.
The key insight is that the gradient of the rate reduction objective has a specific mathematical form. Computing it requires two operations at each step. The first operation compresses the representation by pushing tokens from the same class toward a shared subspace. This operation, when derived from the math, takes the form of a multi-head self-attention block. The second operation sparsifies the representation, expressing each token as a sparse combination of learned dictionary elements. This operation takes the form of a feed-forward MLP with a thresholding nonlinearity, essentially the ISTA step from sparse coding.
Each layer of the resulting network performs one compression step (attention) followed by one sparsification step (MLP). The skip connections arise naturally because each gradient step adds an increment to the current representation rather than replacing it. Layer normalization appears because the rate reduction objective requires the features to be properly scaled. Every component of the transformer architecture has a mathematical explanation.
CRATE: the derived architecture
The architecture has a name, CRATE, for Coding-Rate Reduction Transformer, and its lead author is Yaodong Yu, a PhD student at Berkeley co-advised by Michael Jordan and Yi Ma. Yu arrived at Berkeley from Nanjing University by way of a master’s at the University of Virginia, and his research trajectory reflects the unusual position of Ma’s lab: he works simultaneously on the mathematical foundations of deep learning and on practical questions of robustness and privacy. The CRATE paper, published at NeurIPS 2023, is the centerpiece of his dissertation. What makes the result distinctive is not just that CRATE performs well, but that every component of its architecture can be traced to a specific term in the optimization objective. It is a white-box transformer: every layer has a known mathematical purpose, every parameter has a known role, and the entire forward pass can be interpreted as iterative optimization of a well-defined objective.
The two core blocks of CRATE are MSSA (multi-head subspace self-attention) and ISTA (the sparse coding step). MSSA compresses the token representations by projecting them toward a mixture of learned subspaces, one per “head.” Each head handles one subspace of the mixture, and the multi-head structure arises because the data distribution is modeled as a mixture of low-dimensional subspaces. This provides a principled answer to a question that has puzzled practitioners: what do different attention heads do? In CRATE, each head is responsible for compressing the tokens toward one component of the learned mixture model.
The ISTA block then takes the compressed tokens and sparsifies them against a learned dictionary. This is the same iterative shrinkage-thresholding algorithm from Part 2, applied at each layer of the network. The nonlinearity, soft thresholding, is a principled alternative to the ReLU activation used in standard transformers, and it arises directly from the sparse coding objective.
The forward pass of CRATE thus has a clean interpretation: each layer incrementally pushes the representation toward a more compressed and more sparse form. The compression step (attention) makes each class more coherent. The sparsification step (MLP) makes each token more parsimonious. Layer by layer, the representation converges toward the optimal configuration: a union of incoherent, low-dimensional subspaces.
Does it actually work?
A derived architecture is only interesting if it performs well in practice. CRATE achieves competitive classification accuracy on standard benchmarks, including ImageNet, compared to the standard Vision Transformer at similar parameter counts. It shows consistent improvement with scale: larger CRATE models perform better, just as larger ViTs do.
But performance parity is not the point. The point is interpretability. Because every layer of CRATE has a known mathematical role, you can look inside a trained model and verify that it is doing what the theory predicts. The coding rate of the representation decreases monotonically across layers, confirming that each layer is indeed compressing. The sparsity of the representation increases across layers, confirming that each layer is sparsifying. The attention maps of CRATE are clean and interpretable, with each head focusing on a coherent subset of tokens that share subspace structure.
This is the white box. Not a black box that happens to perform well, but a transparent system whose behavior can be predicted, verified, and understood from its design principles. In a field where the dominant systems are opaque, where even their creators cannot fully explain why they work, CRATE is a proof of concept that transparency and performance can coexist.
The contrast with standard practice is stark. When a standard ViT fails on a task, debugging is archaeological: you probe hidden layers, visualize attention patterns, and guess at what went wrong. When CRATE fails, the theory tells you where to look. If the coding rate is not decreasing at a particular layer, that layer is not compressing effectively. If the sparsity is not increasing, the dictionary is poorly learned. The mathematical framework turns debugging from art into engineering.
Why the transformer might not be an accident
The deepest implication is not about CRATE itself but about the standard transformer. If a principled derivation from a compression objective produces an architecture that structurally resembles the empirically discovered transformer, then the transformer might work not despite the lack of theory, but because it is an approximate solution to the right optimization problem. The convergence from opposite directions, one empirical and one mathematical, suggests that the transformer is not just one good architecture among many. It is close to the natural architecture for the task of learning compressed representations.
This does not mean the standard transformer is optimal. CRATE differs from the standard transformer in specific ways: its attention heads project onto learned subspaces rather than computing softmax-weighted averages, its nonlinearity is soft thresholding rather than ReLU, and its parameters are initialized from the coding-rate structure. These differences represent potential improvements suggested by theory, and some of them, like the interpretability gains, are significant.
Linear-time attention and causal CRATE
The framework goes further. It derives several variants of CRATE for different settings: a linear-time attention variant called ToST (Token Statistics Transformer) that replaces quadratic attention with a statistics-based compression step, reducing computational cost from quadratic to linear in the sequence length; and a causal variant for sequential data like text, where each token can only attend to preceding tokens. The causal variant is particularly important because it connects the compression framework to language models that generate text one token at a time, showing that the same unrolling principle applies when the data has a natural ordering.
There is also a conceptual separation between what the forward pass does and what backpropagation does. In CRATE, the forward pass performs optimization: each layer transforms the representation toward better rate reduction. Backpropagation performs learning: it updates the subspace bases and dictionaries at each layer to better model the distribution of the input data. This separation makes the roles of inference and learning explicit, rather than leaving them tangled inside a black-box training loop.
The trajectory from Part 2 to here has followed a single thread. PCA gave us a two-layer linear network derived from a denoising objective. Sparse coding gave us a deeper nonlinear network derived from a sparsity objective. The rate reduction principle gave us a unified objective function that combines compression and discrimination. And now, unrolled optimization gives us a transformer derived from iteratively maximizing that objective. Each step followed from the previous one by relaxing an assumption or generalizing an approach. The architecture was not designed. It was discovered, the same way twice, once by engineers through years of trial and error, and once by mathematicians through a derivation from first principles. The convergence from opposite directions is perhaps the most compelling argument that the underlying principle is correct.
The next article asks what happens when this framework meets real data: images, 3D objects, human motion, and language. Theory is only as good as its predictions, and the ultimate test is whether the compression thesis survives contact with the messy, high-dimensional, nonlinear world.


