Intelligence Is Compression, Part 3: Learning by Denoising
The compression principle behind Stable Diffusion and DALL-E
“Information is the resolution of uncertainty.” – Claude Shannon, 1948
This is Part 3 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
Denoising is compression. That is the claim, and it is not obvious. When you corrupt an image with static and then train a network to recover the original, the network seems to be learning a cleanup routine. But the argument is that it is learning something deeper: the low-dimensional structure of the data itself. You can only remove the noise if you know where the real data lives. And knowing where the real data lives is exactly what it means to have learned a compressed representation.
This reframes the most commercially successful family of generative models. Stable Diffusion, DALL-E, and their descendants all work by reversing a noise-corruption process step by step. The standard explanation is mechanical: add noise, learn to reverse each step, then run the reversal to generate. The book’s explanation is structural: each denoising step is a compression step. Each step pushes data back toward the thin manifold, the low-dimensional surface in a high-dimensional space, where real images live. Generation is the diffusion process run backward, with compression powering each step.
Noise destroys; denoising discovers
The argument starts with a thermodynamic intuition. Take any structured data, a photograph, a sentence, a recording, and add random noise. The more noise you add, the more the data resembles random static. In information-theoretic terms, the entropy increases. Structure is destroyed. Information is lost.
This is something anyone can verify with a camera and a thought experiment. Take a sharp photograph and start adding pixel noise. At first, you see a slightly grainy version of the same image. Add more, and the details disappear: textures blur, edges soften, fine features vanish into static. Add enough, and the photograph is unrecognizable. The structure of the image, the thing that made it a picture of something rather than nothing, has been diluted into randomness.
The proof is rigorous for the simplest case: adding Gaussian noise to samples from any distribution. At time zero, you have your data, low-entropy, structured. As time increases, the noise accumulates, and the distribution spreads out until, at large enough time, everything looks approximately like a featureless Gaussian blob. The entropy strictly increases at every step. This is one direction of the diffusion process.
Now comes the key question: can you reverse the process? Can you start with the high-entropy noise and, step by step, push it back toward the low-entropy data?
The answer is yes, and the mechanism is the Bayes-optimal denoiser: the mathematically best possible denoiser, the one that makes the smallest average error. For any noise level, it computes the conditional expectation: given this noisy observation, what is the best guess of the original clean data? A beautiful result called Tweedie’s formula shows that the optimal denoiser takes a single step of gradient ascent on the log-density of the noisy distribution. In other words, the denoiser pushes data toward the high-probability regions, toward the modes of the distribution, toward structure.
But one step is not enough. A single application of the denoiser at high noise levels produces outputs that are blurry averages, sitting between the modes of the data rather than on them. This is the same problem as trying to minimize a complex function with a single gradient step when you start far from the optimum. You need many small steps.
One step at a time, from static to signal
The solution is incremental denoising. Instead of trying to remove all the noise at once, you remove it in stages. Start with very noisy data at time T. Apply the denoiser to get a slightly less noisy estimate at time T minus a small step. Use that estimate as input to another denoiser calibrated for a slightly lower noise level. Repeat, stepping down the noise schedule, until you arrive at time zero: the clean data.
Each step does a small amount of work. Each step pushes the estimate a little closer to the data manifold. The composition of all steps, from pure noise to clean signal, traces a path through the space of distributions that progressively resolves structure at finer and finer scales. Large-scale features, the overall layout of an image, the broad shape of a face, emerge first. Fine details, textures, edges, subtle shading, emerge last. Shannon’s phrase applies at every step: information is the resolution of uncertainty, and each denoising step resolves a little more.
This is exactly how diffusion models generate images. The “reverse process” that runs during image generation is an incremental denoising chain. The compression thesis explains why: each denoising step is a compression step. It reduces the entropy of the representation by exploiting the low-dimensional structure of the data. Generation runs the diffusion process backward, and compression is what powers each step.
The neural network that powers this chain has been trained to approximate the Bayes-optimal denoiser at each noise level. The training recipe is surprisingly simple: take a clean image, add random noise at a random intensity, and train the network to predict what the noise was. A network that can separate noise from signal at every intensity has, by definition, learned the structure of the data.
Two labs, same answer
The mathematical core of this mechanism is an object called the score function, and two research groups discovered it independently.
At Stanford, a PhD student named Yang Song was working with Stefano Ermon on a seemingly niche problem: at any point in the data space, which direction points toward where data is most concentrated? Song had come to Stanford from Tsinghua University with a background in mathematics and physics, and his instinct was to treat generative modeling as a problem in differential equations rather than through the adversarial networks that dominated the field at the time. His 2019 NeurIPS paper proposed learning score functions at multiple noise levels and then generating data by following the score field downhill. It was an oral presentation, the top tier, but at the time few people outside the field noticed.
Meanwhile, across the Bay at Berkeley, Jonathan Ho was finishing his PhD under Pieter Abbeel. Ho’s approach was different in language but eerily similar in substance. His June 2020 paper, “Denoising Diffusion Probabilistic Models,” framed the problem as a variational inference task inspired by nonequilibrium thermodynamics: add noise in small steps, learn to reverse each step. The generated images were stunning: faces, bedrooms, churches, rendered with a clarity that rivaled GANs. It was Song who later proved, in a landmark 2021 paper, that the two approaches were discretizations of the same underlying stochastic differential equation. Score-based models and diffusion models were not competing methods. They were the same method, viewed from two angles. The compression framework absorbs both.
The score function itself is the gradient of the log-density of a distribution at a point. Think of it as a vector field spread across the entire data space, with an arrow at every point indicating which direction leads toward higher probability. Near a cluster of face images, the arrows point inward, toward the center of the cluster. In the empty space between clusters, the arrows point toward the nearest mode.
The optimal denoiser is equivalent to taking a score-based step. If you know the score function of the noisy distribution at every noise level, you can denoise. And if you can denoise incrementally, you can generate.
When does the model understand?
When does a denoising model learn the true distribution, and when does it merely memorize training examples?
The answer is precise. If you have infinitely many training samples, the learned denoiser converges to the Bayes-optimal denoiser, and the generated samples come from the true data distribution. The model generalizes. But with finite samples, the situation is more subtle. If the number of samples is too small relative to the complexity of the distribution, the denoiser may simply learn to map noisy inputs back to the nearest training example. In this case, the model memorizes.
The transition between these regimes depends on the relationship between sample count, data dimensionality, and noise level. At high noise levels, where only coarse structure matters, even a moderate number of samples may suffice for generalization. At low noise levels, where fine details must be resolved, far more samples are needed. This provides a principled explanation for a phenomenon that practitioners have observed: diffusion models often generate novel large-scale compositions while reproducing fine-grained textures from the training set.
This is directly relevant to the legal and ethical debates about generative AI. When critics accuse image generators of “copying” training data, the truth is more nuanced. The model copies at the noise levels where it has insufficient data to learn the true distribution, and it generalizes at the noise levels where it does. The boundary depends on the dataset size, the data’s complexity, and the resolution of the features in question.
The analysis does not resolve the debate, but it provides the first rigorous mathematical framework for discussing it. Instead of arguing about whether models “understand” or “copy” in vague terms, we can now ask precise questions: at what noise level does memorization dominate? How many samples would be needed to generalize at a given resolution? These are empirical questions with mathematical structure behind them.
From linear to nonlinear, one principle
The previous article showed that PCA, the simplest representation learning algorithm, is equivalent to a linear denoiser. The encoder projects data onto a subspace, and the projection removes the noise. The idea generalizes to nonlinear distributions. The data no longer lives on a flat plane; it lives on a curved manifold, possibly a mixture of manifolds with different dimensions. The denoiser is no longer a simple projection; it is a learned function that pushes data toward the manifold at every point in space.
But the principle is identical. In both cases, the denoiser works because the data has low-dimensional structure, and removing noise means exploiting that structure. The difference is only in the complexity of the surface and the algorithm needed to find it. PCA finds a flat subspace with a closed-form solution. A diffusion model finds a nonlinear manifold with a neural network trained on millions of examples. Both are solving the same optimization problem: minimize the entropy of the representation.
Gaussian mixture models serve as a useful middle ground, complex enough to illustrate nonlinear phenomena, simple enough to admit exact solutions. For a mixture of Gaussians, the Bayes-optimal denoiser has a closed-form expression: a weighted average of per-component denoisers, where the weights are the posterior probabilities of each component. This example makes the theory concrete. You can see exactly how the denoiser classifies by weighting the components, compresses by projecting onto each component’s subspace, and generates by running the reverse diffusion process.
Denoising is not a trick
Diffusion models are often presented as a clever engineering trick: add noise, then learn to reverse it. The compression thesis reframes them as a natural consequence of a deeper principle. If the goal of learning is to compress data by finding its low-dimensional structure, and if the most natural way to test whether you have found that structure is to corrupt the data and see if you can recover it, then denoising is not a trick. It is the method.
Whether you frame the problem as learning a denoiser, estimating a score function, or minimizing a coding rate, you end up with the same mathematical object. The unification is not a story imposed after the fact. It falls out of a single objective: compress the data by learning its low-dimensional structure.
And it has a deeper pedigree than machine learning. The diffusion process that corrupts data is formally identical to the heat equation: entropy increases as heat spreads. The reverse process is a time-reversed heat equation: entropy decreases as structure is recovered. Representation learning and thermodynamics turn out to be the same mathematics, and the score function plays the role of the gradient of the free energy.
Denoising distinguishes signal from noise. The next question is harder: how do you distinguish signal from signal? A denoiser knows that both a cat and a dog photograph are different from static. It does not know that a cat is different from a dog. Answering that question requires a different principle, lossy compression, and a different objective, the rate reduction principle, which is the mathematical heart of the compression framework.


