Intelligence Is Compression, Part 2: What Does a Neural Network Remember?
PCA, sparse coding, and the origins of deep architecture
“The art of doing mathematics consists in finding that special case which contains all the germs of generality.” – David Hilbert
This is Part 2 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
Here is a thought experiment. Turn on a television with no signal. What you see is static: millions of pixels, each flickering to a random color, independent of its neighbors. In theory, that static could resolve into a photograph of a cat. It could produce the Mona Lisa. It could display any image at all. But if you watched for a thousand years, it never would.
This is not a metaphor. It is a mathematical fact, and it is the starting point for everything in this book. The set of images that look like something, photographs of faces, landscapes, handwritten digits, anything recognizable, occupies a vanishingly small fraction of all possible pixel combinations. The rest is noise.
The question that opens representation learning is: why? And the answer is: because the world has structure. Objects have edges. Surfaces have textures. Light obeys physical laws. Faces have two eyes. These regularities constrain what real images can look like, and they constrain it so severely that the “real” images live on a thin surface inside a vastly larger space. The challenge of intelligence, stated mathematically, is to find that surface.
784 dimensions, 20 that matter
The technical term for this gap between apparent complexity and actual complexity is dimensionality. A photograph of 784 pixels lives, formally, in a 784-dimensional space: each pixel is a coordinate. But if all those photographs are of handwritten digits, the meaningful variation is far smaller. How a 3 curves, how wide the loop of a 0 is, how steeply a 7 crosses: a few dozen such factors account for nearly all the differences between one digit and another. The data is nominally 784-dimensional but effectively 20-dimensional. The other 764 dimensions are noise.
This is not unique to images. It is a universal property of natural data. Text, audio, video, sensor readings, all of them have vastly lower intrinsic dimensionality than their raw encoding suggests. The reason is always the same: the physical world that generates the data is governed by a small number of factors compared to the number of measurements we take. A camera has millions of pixels. A face has two eyes, a nose, and a mouth that move in a few degrees of freedom.
The gap between the raw dimensionality and the intrinsic dimensionality is what makes learning possible at all. If the data really did fill the full space, if every pixel combination were equally likely, no algorithm could generalize from a finite number of examples. Learning works precisely because data is compressible.
Pearson’s century-old trick
The oldest and simplest approach to finding that low-dimensional structure is principal component analysis, or PCA. The idea, first formalized by Karl Pearson in 1901 and developed further by Harold Hotelling in the 1930s, is disarmingly simple: find the directions along which the data varies the most. Those directions are the “principal components,” and they define a subspace, a flat sheet, onto which you can project your data while losing the least information.
Consider a concrete example from the book. Take a collection of images of handwritten digits, each stored as a grid of 784 pixels. In principle, those 784 pixel values could take any combination. But the digits share enormous structure: the curve of a 3, the crossbar of a 7, the loop of a 0. PCA discovers that roughly 20 directions of variation capture almost everything about these images. The remaining 764 dimensions are mostly noise.
What PCA gives you is an encoder-decoder pair. The encoder projects each image down from 784 dimensions to 20. The decoder reconstructs it back. The projection is lossy: you lose some detail. But the detail you lose is precisely the noise, the random pixel-level variation that carries no information about which digit you are looking at. The reconstruction is a denoised version of the original.
This is the book’s first key insight, one it returns to again and again: the simplest form of learning a representation is learning to denoise. Finding the subspace is the same as finding the denoiser. They are the same operation, viewed from two angles.
PCA even has a natural algorithm, called power iteration, that computes principal components by repeatedly multiplying a random vector by the data’s covariance matrix. Each multiplication amplifies the signal and suppresses the noise. The vector converges, step by step, onto the direction of maximum variance. Something striking about this process: it can be interpreted as denoising from pure noise. You start with random static and iterate your way toward structure.
When flatness fails
PCA works beautifully when data lives on or near a flat surface, a linear subspace. But most real data does not cooperate. Faces, for example, span a low-dimensional surface in pixel space, but that surface is curved and folded in complicated ways. A face turned 30 degrees is close to a face turned 31 degrees, but both are far from a face turned 180 degrees. The surface traces a path through pixel space that no flat plane can follow.
Harold Hotelling himself, who, upon hearing George Dantzig present linear programming for the first time, reportedly objected: “We all know the world is nonlinear.”
So what do you do when a single flat subspace is not enough? You use many of them. Instead of modeling your data as living on one plane, you model it as living on a mixture of planes: one subspace for leftward-facing images, another for frontal views, another for rightward faces. Each local neighborhood of the data is approximately linear, even if the global structure is not.
This is where the mathematics gets interesting. A mixture of subspaces can be rewritten as a single overcomplete dictionary: a large collection of “atoms,” or basis vectors, from which any data point can be reconstructed using only a few. The key constraint is sparsity. For any given input, most of the atoms are silent. Only a handful activate, and they combine to reconstruct that particular input.
This is the idea of sparse coding, and its intellectual history is one of the most satisfying stories in computational neuroscience.
The brain’s dictionary
In 1996, two researchers at UC Berkeley, Bruno Olshausen and David Field, published a paper in Nature that posed a simple question: what happens if you train a mathematical system to represent natural images using as few active components as possible? They took patches of natural photographs, trees, buildings, shadows, textures, and optimized a dictionary of basis vectors such that each patch could be reconstructed from a sparse combination of dictionary elements.
What emerged from the optimization looked like the receptive fields of neurons in the primary visual cortex. Oriented edges at different angles and scales. Gabor-like filters. The atoms that the algorithm learned, purely from a mathematical objective of sparse reconstruction, matched what biologists had measured in the brains of cats and monkeys.
This was not a coincidence. Olshausen and Field had not set out to model the brain. They had set out to find the most efficient code for natural images. The brain-like features were an emergent property of the objective function. If the goal of early visual processing is to represent natural images with a compact code, then the brain’s solution and the mathematical solution converge. The visual cortex might be doing sparse coding, not because evolution was trying to implement a specific algorithm, but because sparse coding is what the objective function demands.
Overcomplete dictionary learning, where the number of atoms exceeds the data’s dimensionality, gives rise to an encoder that is no longer a simple projection. Recovering the sparse code from data requires an iterative algorithm called ISTA (iterative shrinkage-thresholding algorithm), which alternates between a linear operation and a nonlinear thresholding step. Each iteration looks remarkably like one layer of a neural network: a matrix multiplication followed by a ReLU-like activation function. Stack enough iterations and you get a deep network, one whose architecture was not designed by an engineer but derived from an optimization problem.
From iteration to architecture
This is the conceptual bridge that connects classical statistics to modern deep learning. PCA gives you a two-layer linear network: encode, decode. Sparse coding gives you a deeper, nonlinear network: each iteration of the sparse recovery algorithm adds a layer. The architecture is not arbitrary. It is the unrolled form of an optimization algorithm that is solving the right problem.
The observation extends further. The complete dictionary learning algorithm, called MSP (matching, stretching, and projection), can be interpreted as a deep linear network. Each layer computes an incremental rotation of the data, and the composition of all layers recovers the dictionary. No backpropagation is needed. The network is learned in a single forward pass.
For overcomplete dictionaries, the ISTA algorithm leads to a nonlinear deep network whose weights depend on the dictionary. Each layer performs one step of sparse recovery, and the output of the final layer is the sparse code. The functional form of each layer, a linear transform followed by soft thresholding, is strikingly similar to the residual network layers that power modern deep learning.
This is the pattern the book will repeat across the next several chapters: start with a mathematical objective, derive the optimization algorithm, unroll the algorithm into a network, and observe that the resulting architecture matches what practitioners discovered empirically. The theory does not just explain why architectures work. It generates them.
Structure, not data
So what does a neural network remember? Structure. Not individual data points, not raw pixel values, but the low-dimensional patterns that make the data predictable.
PCA remembers the principal directions of variation. Sparse coding remembers a dictionary of atoms and, for each input, which atoms to activate. Both are forms of compression: they reduce high-dimensional observations to low-dimensional codes that preserve what matters and discard what does not.
But there is a subtlety worth handling with care. PCA finds the globally optimal linear subspace. Sparse coding, by contrast, involves a nonconvex optimization problem, one with multiple local minima. In many cases the landscape is “benign”: all local minima are close to the global optimum, and descent methods can find them reliably. This result matters because it offers a theoretical guarantee for something practitioners already knew worked.
The progression from PCA to sparse coding to deep networks is not just a historical narrative. It is a logical one. Each step relaxes an assumption, and the resulting algorithm becomes more powerful but also more complex. The book’s claim is that this progression is not arbitrary. It is driven by a single principle: the need to compress structured data into a compact, informative code.
There is a line from Hilbert, quoted at the top of this article, about the art of mathematics consisting in finding the special case that contains the germs of generality. PCA is that special case for representation learning. It is too simple to handle real data, but every idea in it, encoding as projection, decoding as reconstruction, learning as denoising, optimizing over subspaces, carries forward into the more powerful methods that follow.
The next article follows that principle into noisier territory. If denoising is the simplest form of learning a representation, what happens when you take denoising itself as your training objective? The answer turns out to be diffusion models, the technology behind modern image generators.



This is fast becoming our favorite series on explaining the criticality of compression to intelligence.