Representations That Transfer
How neural networks learn internal representations that let them handle unseen cases. Why good representations make hard problems easy.
This is Chapter 6 of A Brief History of Artificial Intelligence.
In early 2013, Tomas Mikolov was staring at a list of numbers that made no sense—and perfect sense at the same time.
Mikolov, a Czech computer scientist at Google’s Mountain View campus, had spent months on a seemingly mundane problem: how to represent words as numbers. The standard approach was brute force. Assign each word a unique ID. “Cat” is word 4,537. “Dog” is word 12,891. Simple, but useless—the numbers captured nothing about what words meant. Cat was as numerically distant from dog as it was from democracy or quantum.
So Mikolov tried something different. He built a small neural network and trained it on a simple task: given a word, predict the words that tend to appear nearby. Feed it “cat” and have it guess “furry,” “pet,” “meow.” The network’s internal representation—the pattern of numbers it used to encode each word—was just a means to an end, a hidden layer on the way to prediction.
But when Mikolov examined those hidden representations, he found something interesting.
The network had organized words into a vast geometric space. Each word was a point, defined by hundreds of numbers—its coordinates in this abstract space. Words with similar meanings clustered together. “Cat” was near “dog” and “pet.” “King” was near “queen” and “monarch.” So far, unsurprising.
Then Mikolov tried an experiment that would become legendary.
He took the coordinates for “king,” subtracted the coordinates for “man,” added the coordinates for “woman,” and searched for the nearest word.
The answer was “queen.”
King − man + woman ≈ queen.
The network had learned, without anyone teaching it, that the relationship between “king” and “man” is the same as the relationship between “queen” and “woman.” Royalty was a direction in space. Gender was another direction. The geometry encoded analogy itself.
Mikolov tested more relationships. Paris − France + Italy ≈ Rome. The “capital of” relationship was a direction. Walking − walked + swam ≈ swimming. Verb tense was a direction. Bigger − big + small ≈ smaller. Comparison was a direction.
The network had discovered that meaning has geometry. Concepts aren’t just similar or different—they’re related by consistent spatial transformations. And it had learned this from nothing but word co-occurrence, from the statistical patterns in billions of words of text.
“We were shocked,” Mikolov later said. “We didn’t expect the structure to be so clean.”
This chapter is about what neural networks actually learn. Not the weights and biases—those are just numbers. But the representations: the internal structures that encode meaning, that capture patterns, that make generalization possible. Representations are where the magic happens. They’re why a system trained on specific examples can say something meaningful about examples it has never seen.
The Deep Puzzle
Here is the puzzle at the heart of machine learning: how can a system that only sees specific examples generalize to new ones?
A child sees perhaps a few hundred cats before the concept solidifies. Yet she can then recognize any cat—breeds she’s never encountered, poses she’s never seen, cats partially hidden behind furniture, cats in photographs, cats in cartoons. She hasn’t memorized individual cats. She’s learned something more abstract, something that applies to cats she’ll encounter for the rest of her life.
Neural networks face the same challenge, at scale. Train a network on a million images of cats and dogs. The goal isn’t to memorize those million images—any database could do that. The goal is to correctly classify the million-and-first image, one the network has never seen.
How is this possible? The network has only seen specific images. How can it say anything about images it hasn’t seen?
The answer is representation. The network doesn’t store raw images. It transforms images into internal representations that capture what matters for the task—the features that distinguish cats from dogs—while discarding what doesn’t matter, like the specific background or lighting. When a new image arrives, the network transforms it into this representation space. If the representation is good, similar things end up nearby. The new cat lands near other cats. Classification becomes geometry.
But this just pushes the question back. What makes a representation good? Why do neural networks learn representations that generalize, rather than representations that just memorize the training data?
Seeing Into Networks
In 2013, the same year Mikolov discovered the geometry of words, two researchers at NYU set out to understand what image networks actually learn.
Matthew Zeiler had done his PhD with Geoffrey Hinton, one of the founders of deep learning. Rob Fergus was a computer vision researcher who had helped organize the ImageNet challenge. Together, they developed a technique for visualizing what each neuron in a deep network responds to. The method was clever: run the network backward, tracing which input patterns cause each internal neuron to activate.
What they found was a hierarchy, built layer by layer from simplicity to complexity.
Neurons in the first layer responded to edges. Simple boundaries between light and dark, oriented at various angles. Vertical edges. Horizontal edges. Diagonals. The network had independently discovered what neuroscientists had found in the visual cortex decades earlier: edge detection is fundamental to vision.
Neurons in the second layer combined edges into textures and corners. A corner detector fired when a vertical edge and a horizontal edge met at a point. A texture detector fired on repeating patterns of parallel edges. These neurons were building complexity from the edge primitives below.
Neurons in the third and fourth layers detected parts. Eyes—circular shapes with dark centers. Wheels—circles with spokes. Windows—rectangular grids. Fur—specific texture patterns. These weren’t object categories; they were components that appear across many objects. An eye detector fires on cats, dogs, and humans alike.
Neurons in the deepest layers responded to whole objects. Face detectors. Car detectors. Flower detectors. Building detectors. These neurons integrated the parts below into coherent object representations.
The hierarchy was learned, not programmed. Nobody told the network to detect edges first, then textures, then parts, then objects. It discovered this structure because hierarchical representation is useful—because this is how visual structure actually works. Objects are made of parts. Parts are made of textures and shapes. Shapes are made of edges. The network learned to see by learning these layers of abstraction.
Yann LeCun had glimpsed this principle thirty years earlier, in his networks for reading handwritten digits. Early layers learned stroke detectors. Later layers learned digit shapes. But LeCun’s networks were tiny by modern standards, trained on a few thousand images. The ImageNet networks, trained on millions of images, learned far richer hierarchies. The principle was the same, but the depth of structure was incomparably greater.
This hierarchy explains generalization. Two images of different cats—a tabby and a Persian, one sitting and one running, one in sunlight and one in shadow—might look very different as raw pixels. But their representations in the network’s deeper layers are similar. Both have cat-like faces, cat-like fur, cat-like body shapes. The network sees past surface variation to underlying structure. New cats are recognized because they share this structure with cats the network has seen before.
The Manifold Hypothesis
There’s a mathematical reason why learned representations work, and it’s one of the most profound ideas in modern machine learning.
Consider a simple grayscale image, 256 by 256 pixels. Each pixel takes a value from 0 to 255. The space of all possible images—every possible combination of pixel values—is unimaginably vast. There are more possible images than atoms in the observable universe, by a factor that has more digits than this book has pages.
But almost none of those possible images look like anything. Set pixel values randomly and you get static—gray noise without structure. Natural images—photographs of faces, animals, landscapes—occupy a tiny, minuscule fraction of the space of all possible images.
This is the manifold hypothesis: high-dimensional data lies on low-dimensional structure.
A manifold is simply a space that looks flat when you zoom in close enough—like the surface of the Earth, which seems flat locally even though it curves globally. Think of the surface of the Earth. The Earth exists in three-dimensional space, but its surface is two-dimensional. You can specify any point on Earth with just two numbers—latitude and longitude. The surface is a two-dimensional manifold embedded in three-dimensional space.
Natural images are similar. The same 256×256 grayscale image from our earlier example requires 65,536 pixel values to specify. But the actual variety of natural images—the ways faces can vary, the ways landscapes can differ—might be captured by far fewer underlying dimensions. The “manifold” of face images might be parameterized by things like age, gender, expression, lighting angle, head pose. A few dozen parameters might span most of the variation in millions of faces.
When a neural network learns good representations, it’s learning to map high-dimensional inputs onto this low-dimensional manifold. It’s discovering the true factors of variation—the small number of dimensions that actually matter. Once you’re on the manifold, generalization becomes simple: nearby points on the manifold are genuinely similar, because they share underlying structure.
The manifold hypothesis provides an intuition for explaining why deep learning works despite the curse of dimensionality—the principle that high-dimensional spaces require exponentially more data to cover. If data actually lived throughout the full high-dimensional space, learning would be impossible. But data doesn’t fill the space; it clusters on low-dimensional structures. Networks learn to find these structures.
Representations That Transfer
If representations capture genuine structure, they should be useful beyond the task they were trained for. A network that has learned edges, textures, and object parts for ImageNet classification has learned something about images in general—something that should help with other image tasks.
In 2014, Jason Yosinski and colleagues at Cornell tested this hypothesis. They took networks trained on ImageNet and measured how well the learned features transferred to completely different tasks.
The results transformed the field.
Networks trained on photographs of everyday objects—dogs, cars, furniture—produced features that dramatically improved performance on medical images. Chest X-rays, retinal scans, pathology slides: domains with completely different visual content benefited from representations learned on ImageNet. Edge detectors trained on photographs detected edges in X-rays. Texture analyzers trained on animal fur detected tissue textures in microscopy.
The practical impact was immediate. Training a deep network from scratch requires millions of labeled images, months of compute time, and specialized expertise. But fine-tuning a pretrained network—freezing the lower layers and retraining only the top—requires just hundreds or thousands of examples. A hospital with a few thousand labeled scans could build a diagnostic system by leveraging representations learned from millions of photographs.
Transfer learning became the default approach. Stanford researchers used ImageNet features to detect diabetic retinopathy in eye scans. Conservation biologists used them to identify species in camera trap images. Satellite companies used them to analyze land use from orbital photographs. The representations transferred because visual structure is universal: edges are edges, textures are textures, whether you’re looking at a cat or a tumor or a forest.
The same principle revolutionized language. When BERT appeared in 2018, trained to fill in missing words in billions of sentences, its representations transferred to dozens of downstream tasks. Question answering, sentiment analysis, textual entailment, named entity recognition—tasks with different goals and different data all benefited from BERT’s learned understanding of language structure.
Representations transfer because structure is real. The patterns that distinguish meaningful images from noise, meaningful sentences from gibberish, are the same patterns regardless of the specific task. Good representations capture these patterns. And patterns are everywhere.
Meaning as Geometry
Why did Mikolov’s word vectors work so well? The “king − man + woman ≈ queen” result wasn’t a trick or a coincidence. It reflected something real about how the network had organized knowledge.
When you train a network to predict words from context, it must learn which words are similar—which words can substitute for each other in similar contexts. “King” and “queen” appear in similar contexts: sentences about royalty, monarchy, rulers. So do “man” and “woman”: sentences about people, gender, relationships. The network places similar words nearby in its representation space.
But the geometry goes deeper. The network learns that certain directions in space correspond to consistent transformations. The direction from “man” to “king” is the same as the direction from “woman” to “queen”—a direction that might be labeled “royalty” or “ruler of.” The direction from “France” to “Paris” is the same as from “Italy” to “Rome”—the “capital of” direction. The direction from “walk” to “walked” is the same as from “jump” to “jumped”—the past tense direction.
These directions weren’t programmed. They emerged from learning. The network discovered, purely from patterns of word co-occurrence, that concepts have geometric relationships. Analogy is arithmetic in representation space.
Later research extended this principle far beyond words. Image networks learn representations where directions correspond to visual attributes. Add a “glasses” direction to a face representation, and you get the same face wearing glasses. Add an “aging” direction, and the face grows older. Add a “smile” direction, and the expression changes. The geometry of meaning isn’t limited to language; it’s how neural networks organize knowledge in general.
This discovery—that concepts have geometry, that relationships are directions, that analogy is arithmetic—was foundational for modern AI. It showed that learned representations aren’t just useful; they’re meaningful. They capture structure that aligns with how humans think about concepts and relationships. The networks learned, without explicit instruction, to organize knowledge in ways that support reasoning.
What Representations Don’t Capture
Representation learning changed how we think about meaning. Instead of symbols and rules, neural networks organize concepts as points in high-dimensional space. Similar things lie close together. Dissimilar things are pushed apart. Meaning becomes geometry.
But geometry has limits.
Compositional generalization exposes one of them. Human concepts compose naturally. If you know red and you know square, you can imagine a red square without ever having seen one. Composition, for humans, is rule-based recombination.
Early neural networks struggled badly here. Train a model on red circles and blue squares, and it would often fail on red squares—a combination trivial for humans but absent from the training set. What the model learned were correlations embedded in its experience, not the generative rules behind them.
Modern networks do better. Scale helps. Attention helps. Self-supervised training exposes models to a vast range of contexts. Many combinations that once lay outside the training distribution are now merely rare. When the space is densely sampled, recombination often works.
But the improvement is geometric. Today’s models usually succeed by interpolating across a well-shaped manifold, not by applying explicit compositional rules. When a combination lies genuinely beyond the regions shaped by training—when extrapolation is required rather than interpolation—failures still appear, and often abruptly.
The gap has narrowed. It has not disappeared.
Causal reasoning. Representations capture correlations, not causes. A network might learn that umbrellas correlate with wet streets, but it doesn’t know that rain causes both. In the training distribution, this doesn’t matter—the correlations are reliable. But when conditions change, when the correlations break, networks fail in ways that humans wouldn’t. They’ve learned what goes together, not why.
Brittleness. Small changes to inputs can cause large changes in representations. Add carefully crafted noise to an image—noise invisible to humans—and the network’s classification can flip from “cat” to “toaster” with high confidence. The representations, while useful, don’t capture robustness. They’ve learned the typical patterns but not the essential ones.
These limitations share a source. Representations learn from data, and they cannot transcend what the data contains. They find patterns, but they don’t understand why those patterns exist. They generalize to similar data, but they can’t reason about data that’s structured differently.
Closing: Structure, Everywhere
This chapter began with a researcher staring at numbers, discovering that meaning has geometry. It ends with a deeper question: what do neural networks actually learn?
They learn representations. Internal structures that capture patterns in data. Hierarchies of features, from edges to objects. Geometries of concepts, where relationships are directions in space. Manifolds of variation, where high-dimensional data reveals low-dimensional structure.
Good representations enable generalization because they capture what matters and discard what doesn’t. They enable transfer because structure is universal—the same edges, textures, and patterns appear across domains. They emerge from learning because learning is, fundamentally, the discovery of structure.
Mikolov’s word vectors encoded analogy as arithmetic. Zeiler and Fergus’s visualizations revealed hierarchies of features. Transfer learning showed that representations travel across domains. These weren’t separate discoveries; they were facets of the same deep truth.
The networks hadn’t memorized facts. They’d learned to recognize structure. And structure was everywhere.
Notes and Further Reading
On Word Vectors
Tomas Mikolov’s Word2Vec papers—”Efficient Estimation of Word Representations in Vector Space” and “Distributed Representations of Words and Phrases” (both 2013)—introduced the techniques that revealed meaning as geometry. The “king − man + woman = queen” result captured widespread imagination and launched an industry of embedding methods. For accessible explanation, Christopher Olah’s blog posts on word embeddings remain excellent.
On Feature Visualization
Matthew Zeiler and Rob Fergus’s “Visualizing and Understanding Convolutional Networks” (2014) systematically explored what each layer of a deep network learns. Chris Olah and colleagues at Google later created “Feature Visualization” (Distill, 2017), an interactive exploration that lets you see network features directly. These visualizations made the abstract concrete—you can literally see the edge detectors and texture analyzers that networks learn.
On the Manifold Hypothesis
The mathematical foundations trace to work on dimensionality reduction and topology. For intuition without mathematics, the key idea is simple: natural data has structure, and that structure is lower-dimensional than the space data lives in. Networks learn to find this structure. Yoshua Bengio’s “Representation Learning: A Review and New Perspectives” (2013) provides technical grounding.
On Transfer Learning
Jason Yosinski’s “How transferable are features in deep neural networks?” (2014) established that ImageNet representations improve performance across domains. The practical impact was transformative—transfer learning made deep learning accessible to fields without massive datasets. The success of pretrained language models (BERT, GPT, etc.) extended the principle to text.
On Limitations
For compositional generalization, Brenden Lake’s work on the “SCAN” benchmark and his paper “Generalization without Systematicity” (2018) demonstrate failures that highlight what networks don’t learn. For adversarial examples, Ian Goodfellow’s “Explaining and Harnessing Adversarial Examples” (2015) explores the brittleness of learned representations. Gary Marcus provides persistent philosophical critique of what neural networks actually understand.


