Intelligence Is Compression, Part 4: The Information Game
A single objective that explains DINO, contrastive learning, and classification
“We compress to learn, and we learn to compress.” – High-Dimensional Data Analysis, Wright and Ma, 2022
This is Part 4 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architectures of modern AI.
Every time you save a JPEG, you make a bargain: the image gets smaller, but some detail is lost. JPEG decides what to discard based on what the human eye can’t see. But there is a harder version of this question. What if the goal is not to reconstruct one image faithfully, but to tell one kind of image from another? A doctor reading an X-ray does not need every pixel. She needs the features that distinguish healthy tissue from a tumor. A self-driving car does not need a perfect photograph of the road. It needs to tell a pedestrian from a lamppost. The question shifts from “which details are imperceptible?” to “which details are irrelevant for discrimination?” This is the question representation learning must answer.
A good representation is a lossy code: one that preserves everything that distinguishes one category from another, and discards the rest. The challenge is finding a mathematical measure that captures how well a representation achieves this. That turns out to be harder than it sounds.
The measure that fails
Imagine eight points arranged on a line in two-dimensional space. In one arrangement, the points are evenly spaced. In another, they cluster tightly in two groups with a gap in between. These are clearly different structures. Any useful measure of “how well-organized is this data” should be able to tell them apart.
The three standard measures from statistics and information theory, dimension, volume, and entropy, cannot. Both arrangements have the same dimension: one. The same volume: zero, since a line has no area. The same entropy: three bits, since each of eight equally likely points carries the same information. All three assign identical values to both arrangements. The structure is invisible to them.
The problem is that these measures were designed for data that fills a space, not data that lies on a thin surface inside a larger space. When data is low-dimensional, the standard measures degenerate: they return zero, or infinity, or the same value regardless of structure.
The resolution is to measure with finite precision. Instead of asking “how much information does this distribution contain?” you ask a more practical question: “how many bits do I need to encode each sample so that the reconstruction error stays below a fixed tolerance?” The encoding is lossy: you accept some error, because the goal is to measure structure, not to reconstruct every detail. In exchange you get a finite, meaningful number even for data that lives on a thin surface. This quantity is the lossy coding rate.
Return to the eight points. In the uniform arrangement, the points spread across a long line. To specify each point within tolerance ε, you must distinguish many positions along the full length. In the two-cluster arrangement, you first say which cluster (1 bit), then specify a position within a much shorter range. The clustered data is cheaper to encode because the structure makes it more compressible.
The standard measures missed this entirely. Why? Because they treat data points as abstract symbols: eight labels, arbitrarily assigned, with no notion of distance. Two points can be neighbors in space but get unrelated codes. The geometry is invisible. Lossy coding forces geometry back in. The moment you require that points within distance ε share a code, the encoding must respect proximity. Nearby points get the same code. Distant points get different codes. Structure becomes visible because the encoding is forced to reflect it.
Compact within, spread apart
To see how the lossy coding rate becomes a learning principle, consider a concrete case. You have a thousand images of cats and dogs, each encoded as a high-dimensional vector. Encode all thousand together, treating them as a single pool: the encoder must cover the full spread of both cats and dogs, so the lossy coding rate is high. Say 50 bits per sample. Now suppose you know which images are cats and which are dogs. Encode each group separately. The 500 cat images are similar to each other, varying only in ear shape, fur pattern, pose. Encoding a cat requires only describing where it sits within the space of cats. Say 15 bits. Dogs similarly: 18 bits. The weighted average is about 16.5 bits per sample.
The difference, 33.5 bits, is the rate reduction. It measures how much information the class structure contains. Maximizing this difference is the goal of representation learning.
This is the principle of maximal coding rate reduction, abbreviated MCR-squared. It is a single objective function that says: a good representation should make the total coding rate as large as possible while making the per-class coding rate as small as possible. In practice, this means two things. Between-class discrimination: features from different classes should be spread apart in the representation space, so that encoding them all together is expensive. Within-class compressibility: features from the same class should cluster tightly, so that encoding each class individually is cheap. The gap between the two is the measure of quality.
A representation that scores well on this objective has captured the structure of the data: the class boundaries, the categorical distinctions, the organization that makes a thousand images more than a pile of unrelated pixels.
With two classes this is straightforward. With a thousand, it is not. ImageNet has 1000 categories, and a typical network maps each image to a 512-dimensional feature vector. Each category is not a single point in that space but a low-dimensional subspace: golden retrievers vary along perhaps five directions (angle, lighting, fur shade), while sports cars vary along eight (color, angle, model, background). The rate reduction objective must arrange 1000 subspaces of different shapes and dimensions in 512-dimensional space so that each is internally compact and all are mutually spread apart. This is not a problem intuition can solve. It is a problem an objective function was built for.
Contrastive learning falls out
One of the most satisfying results is the connection to contrastive learning. When DINO was published in 2021, it surprised the field by producing excellent image features without any labels at all. The features were so well-organized that a simple linear classifier applied on top of them could compete with fully supervised systems.
DINO and related contrastive methods turn out to be approximate solutions to the rate reduction objective. They work because they are solving, roughly, the right optimization problem. The “push similar things together, push different things apart” heuristic that practitioners developed by intuition turns out to be an approximation to “maximize the coding rate reduction.” The theory does not just explain why contrastive learning works. It explains what, mathematically, contrastive learning is doing.
This is a pattern the series has traced before. In Part 2, sparse coding reproduced the receptive fields of the visual cortex. In Part 3, denoising reproduced the mechanism behind diffusion models. Here, the rate reduction principle reproduces the behavior of contrastive learning. Each time, a method that was discovered empirically turns out to be an approximate solution to the same underlying optimization problem. The convergence across methods, developed independently by different teams for different purposes, is the strongest evidence for the framework’s core claim: these are not different ideas. They are different approximations to the same idea.
The DINO result deserves a moment of context, because its surprise value is what makes the theoretical explanation meaningful. In April 2021, Mathilde Caron and a team at Facebook AI Research released a self-supervised method with a disarmingly simple recipe: train two copies of a vision network on different augmented views of the same image, with one copy slowly tracking the other’s weights. No labels. No contrastive pairs explicitly constructed. No adversarial training. The method was called DINO, and the features it learned were remarkable. A linear classifier placed on top of DINO features could match the accuracy of networks trained with millions of human-annotated labels. Even more striking, the attention maps of DINO models spontaneously segmented objects from backgrounds, a capability that no one had trained the network to exhibit. The AI community’s reaction was a mix of excitement and bafflement: it worked, but why? The rate reduction framework provides a mathematical answer. DINO’s training objective, which pushes augmented views of the same image together and allows representations of different images to spread apart, is an approximation to maximizing coding rate reduction. The mystery is not that DINO works. The mystery is that a principled objective, derived from information theory, predicts what an empirically discovered method does.
This is particularly important for the unsupervised case, where class labels are not available. The rate reduction principle extends to this setting by introducing a soft membership matrix: instead of assigning each data point to a single class, you assign probabilities of belonging to each of several latent groups. The objective becomes a joint optimization over both the representation and the group assignments. The resulting algorithm simultaneously learns to represent the data and to cluster it, because representation and clustering are, in this framework, two sides of the same coin.
Packing clusters into space
Whether the classes are labeled in advance or discovered during training, the optimal solution has the same geometry. Think of the representation space as a room. Each class of data occupies a region of that room. The rate reduction objective says: make each region as small as possible, and spread the regions as far apart as possible. Cats in one tight corner, dogs in another, with maximum distance between them.
The total coding rate measures how much of the room the data occupies overall. The per-class coding rate measures how much each cluster occupies individually. Maximizing the difference means packing many small, tight clusters into the largest possible total space, with the clusters as far apart and as orthogonal to each other as possible. This is the spatial version of the bit-counting argument from the previous section: spreading the classes apart makes the total encoding expensive, while compressing each class makes the individual encodings cheap.
Shannon faced a version of this problem in 1948. He was designing communication channels: how many distinct messages can you transmit over a noisy phone line? Each message is encoded as a signal. If two signals are too similar, noise will make them indistinguishable. So you want each signal to occupy a small, tight region of the signal space, and you want the regions to be as far apart as possible. Shannon proved there is a fundamental limit to how many signals you can pack into a channel of fixed bandwidth, and that the optimal packing achieves it by making signals compact and spread apart.
The rate reduction principle is the same problem in a different setting. Instead of signals in a channel, you have classes in a representation space. Instead of noise corrupting messages, you have variation within each class. The objective is identical: pack as many distinguishable clusters as possible, each as tight as possible, into the available space. The information game is the same game. Shannon played it for communication. The compression thesis plays it for learning.
From objective to network
The rate reduction principle is an objective function: it tells you what a good representation looks like, but it does not yet tell you how to find one. For simple distributions, like mixtures of Gaussians, the optimal representation can be computed analytically. For complex distributions, like natural images, you need a neural network to transform the data into a representation that maximizes the rate reduction.
The next article takes this step. If you take the rate reduction objective and optimize it iteratively, unrolling each iteration into a layer of a neural network, what architecture emerges? The answer is a network with components that look remarkably like the attention heads and MLP blocks of a transformer. Not because someone designed it to look that way, but because the mathematics demands those specific operations. It is the most dramatic result in the compression thesis.


