The Paradigm Shift
The turning point. Why showing examples beats writing rules, and how the paradigm shifted from specification to training.
This is Chapter 4 of A Brief History of Artificial Intelligence.
In 1958, a psychologist named Frank Rosenblatt built something that would change everything—though it would take fifty years for anyone to realize it.
The machine was called the Perceptron. It filled a room at the Cornell Aeronautical Laboratory—photocells for eyes, electric motors for adjustable connections, all wired directly from input to output. The Perceptron is the simplest type of neural network, consisting of a single layer of weights that connect inputs to outputs. The inputs themselves aren’t considered a layer—they’re just the incoming data. Rosenblatt would show it images—crude black and white patterns on cards—and the machine would learn to recognize them.
Not by being programmed to recognize them. By being shown examples.
The Perceptron started knowing nothing. You’d show it a card with a pattern and tell it what it was. The machine would guess, get it wrong, and adjust its internal settings slightly. Show it another card. Another adjustment. Thousands of examples, thousands of tiny adjustments. Gradually, through nothing but exposure to examples and feedback, the machine learned to classify patterns it had never seen before.
This was revolutionary. Every AI system before the Perceptron had been explicitly programmed. Researchers would analyze a problem, figure out the rules, write code to implement those rules. The machine did exactly what it was told, nothing more.
The Perceptron learned. You didn’t program its behavior—you trained it through experience.
Rosenblatt understood what this meant. In 1958, at a press conference announcing the Perceptron, he made a bold prediction: “The Navy revealed the embryo of an electronic computer today that it expects will be able to walk, talk, see, write, reproduce itself and be conscious of its existence.”
The press ate it up. The New York Times ran the story. Other researchers were less impressed—particularly Marvin Minsky and Seymour Papert, two of AI’s leading figures who were convinced that symbolic approaches were the future.
What Rosenblatt had built was a learning machine. The first real one. But it would take decades, a devastating critique, near extinction, patient resurrection, and a fundamental technological breakthrough before learning machines would prove him right.
The First Death
The Perceptron worked, within its limits. It could learn to distinguish simple patterns—circles from squares, horizontal lines from vertical lines, the letter X from the letter O. For problems where you could draw a single straight line to separate categories, it worked perfectly.
But for anything more complex, it failed. The classic example was XOR—exclusive or. Given two inputs, output true if exactly one is true, but not both.
Visualize it this way: imagine a piece of graph paper. Put white dots in the top-left and bottom-right corners. Put black dots in the top-right and bottom-left corners—a checkerboard pattern. Now try to draw a single straight line that separates all the white dots from all the black dots.
You can’t. Draw a line to isolate the top-left white dot, and you’ll cut right through the other corners. You’d need two lines, or a curved line, or something more complex.
Rosenblatt’s Perceptron was the mathematical equivalent of only being able to draw one straight line. It could learn any pattern where a single line could separate the categories—but the checkerboard defeated it. It couldn’t solve XOR. No single-layer neural network could. This was a mathematical fact.
In 1969, Minsky and Papert published “Perceptrons,” a rigorous mathematical analysis of what single-layer neural network could and couldn’t learn. The book was careful and correct: it showed the limitations of simple neural networks clearly.
But it was widely interpreted as showing that neural networks in general were a dead end.
Minsky and Papert didn’t quite say that. They acknowledged that multi-layer networks might overcome these limitations. But they also suggested—correctly, at the time—that nobody knew how to train multi-layer networks effectively. And they emphasized the limitations loudly enough that funding for neural network research evaporated almost overnight.
This was 1969. The AI winter wouldn’t begin for another five years, but for neural networks, winter came early. Research moved to symbolic AI. The few researchers who continued working on neural networks found themselves marginalized, unfunded, unfashionable.
Frank Rosenblatt never saw neural networks’ resurrection. He died in a boating accident in 1971, at age 43. The Perceptron seemed to die with him.
But a handful of researchers kept working. Quietly, persistently, convinced that learning from examples was the right approach even if the immediate path forward wasn’t clear.
They would be proven right. But not yet.
The Missing Piece
The problem Minsky and Papert identified was real: single-layer neural networks—connecting input directly to output, like the Perceptron—had fundamental limitations. The solution seemed obvious: add more layers. Create networks where simple pattern detectors in early layers feed into more complex pattern detectors in later layers, building up hierarchical representations of increasing abstraction.
In principle, multi-layer networks could learn anything. The question was how to train them.
This wasn't just intuition—it was mathematics. In 1989, George Cybenko proved what's now called the Universal Approximation Theorem: a neural network with even one hidden layer—unlike the original Perceptron which had none—can approximate any continuous function to arbitrary precision, given enough neurons. Kurt Hornik and colleagues extended this result shortly after. The mathematics was clear: neural networks are universal function approximators.
What this means in practical terms: Any pattern you can describe mathematically—any relationship between inputs and outputs, any classification boundary, any transformation—can be learned by a neural network with sufficient capacity. The network doesn’t need to be given the function explicitly. It can discover it through training.
This was profound. It meant that the limitations of single-layer perceptrons weren’t fundamental to neural networks—they were specific to the single-layer architecture. With depth, networks gained universal representational power. They could, in principle, learn to recognize any pattern, approximate any function, represent any relationship.
But “in principle” hid enormous practical challenges. The theorem guaranteed that a solution exists—that there’s some configuration of weights that would work. It didn’t say how to find those weights. It didn’t address how much data you’d need, how long training would take, or whether the learning process would even converge to a good solution.
Most critically, it didn’t solve the credit assignment problem.
Training a single-layer perceptron was straightforward. You knew which output neuron fired, you knew what it should have done, and you could adjust the weights directly. But in a multi-layer network, the hidden layers—the neurons between input and output—posed a problem. When the network made a mistake, how much was each hidden neuron responsible? How should you adjust weights you couldn’t directly observe?
This was the credit assignment problem. When something goes wrong, how do you assign credit or blame to the internal components?
The Universal Approximation Theorem told you that a solution existed somewhere in the space of possible weight configurations. But it didn’t tell you how to navigate that vast space to find it.
The answer came from several researchers working independently in the 1970s and early 1980s. The key insight: you could use calculus. Specifically, you could use the chain rule to propagate the error backward through the network, from output to input, calculating how much each weight contributed to the final error.
This technique came to be called backpropagation. The basic idea had been discovered and rediscovered multiple times—by Paul Werbos in his 1974 PhD thesis (largely ignored), by others in the late 1970s, and most influentially by David Rumelhart, Geoffrey Hinton, and Ronald Williams in a 1986 paper that would become one of the most cited papers in AI.
Backpropagation solved the credit assignment problem. You could now train multi-layer networks. You could learn hierarchical representations. The limitations Minsky and Papert identified could be overcome.
Neural networks weren’t dead. They just needed backpropagation to truly come alive.
The Quiet Resurrection
The 1986 paper by Rumelhart, Hinton, and Williams didn’t cause an immediate revolution. The AI winter was in full effect. Funding was scarce. Symbolic AI, despite its problems, still dominated. Neural networks were still unfashionable.
But a small community of researchers took notice. Throughout the late 1980s and 1990s, neural networks with backpropagation proved useful for specific tasks. Handwritten digit recognition. Speech recognition. Control systems. These were practical applications, not toy problems.
Yann LeCun, working at Bell Labs, developed convolutional neural networks—specialized architectures for image recognition that exploited the structure of visual data. By the mid-1990s, these networks were reading zip codes on mail for the US Postal Service. Real work, real deployment.
But these were still narrow applications. The networks were small, shallow—only a few layers deep. They worked for specific problems but showed no sign of achieving anything like general intelligence.
Two fundamental problems remained: computation and data.
The computation problem: Training neural networks required enormous numbers of calculations. Adjusting millions of weights through millions of examples meant billions or trillions of mathematical operations. The computers of the 1990s could barely handle it. Training even moderately sized networks took days or weeks.
The data problem: Learning from examples requires lots of examples. But collecting, labeling, and organizing training data was expensive and time-consuming. You needed thousands of images each carefully labeled, thousands of speech samples each transcribed. For most problems, that much labeled data simply didn’t exist.
These weren’t theoretical problems. They were practical limitations that kept neural networks confined to specific niches. The approach was right, but the technology wasn’t ready.
What neural network researchers needed was more computational power and more data. A lot more of both.
They would get both. But not until the 2000s.
The Conditions Align
Two technological developments changed everything: GPUs and the internet.
GPUs—Graphics Processing Units—were designed for video games. They needed to render millions of pixels at high frame rates, which meant performing millions of simple calculations in parallel. It turned out that the matrix multiplications required for neural network training were exactly the kind of operations GPUs did well.
Around 2009, researchers started using GPUs for deep learning. Training that took weeks on CPUs now took days or hours. Networks that were impractically large suddenly became feasible. The computation bottleneck was breaking.
The internet provided data. Millions of images posted online. Videos uploaded every minute. Text from websites, books, articles. User interactions, click patterns, search queries. The internet was generating training data at unprecedented scale, and much of it was accessible.
In 2009, Fei-Fei Li and her team at Stanford released ImageNet—a dataset of 14 million labeled images across 20,000 categories. Creating it required three years and thousands of hours of human labor to label images. But once created, it gave researchers an unprecedented resource for training and testing image recognition systems.
ImageNet came with an annual competition: who could build the system that most accurately classified images from a held-out test set? The competition started in 2010. The best systems, using traditional computer vision techniques, achieved around 72% accuracy.
Then came 2012.
The Breakthrough and A Rebranding
In September 2012, Geoffrey Hinton’s team from the University of Toronto entered the ImageNet competition with a deep convolutional neural network they called AlexNet. It wasn’t the first deep neural network for image recognition, but it was the first to combine several key insights: deeper architecture, GPU training, clever regularization techniques, and a new activation function called ReLU—short for Rectified Linear Unit, which simply meant: if the input is negative, output zero; if zero or positive, pass it through unchanged. This simple rule, replacing more complex functions used before, made networks faster to train and less prone to getting stuck during learning.
The results were shocking. AlexNet achieved 85% accuracy, crushing the previous year’s winner by more than 10 percentage points. It wasn’t a marginal improvement—it was a leap.
More importantly, the errors AlexNet made looked different. Previous systems made bizarre mistakes—calling a dog a car, a tree a building. These were failures of understanding that suggested the system had no idea what it was looking at.
AlexNet’s mistakes were more subtle. It would confuse similar dog breeds, mix up species of cats, mistake types of vehicles. These were the kinds of mistakes a human might make—errors that suggested the system actually was learning something meaningful about the visual world.
The computer vision community took notice immediately. By 2013, almost every competitive entry used deep learning. By 2015, deep learning systems surpassed human-level performance on ImageNet. The paradigm had shifted.
Hinton had a name for what was happening: “Deep Learning.” He’d been using the term since 2006, when he published a paper showing how to train networks with many layers. The name was deliberate—a rebranding. “Neural Networks” reminded people of failed promises and the Minsky-Papert critique. “Deep Learning” emphasized what was new: depth. Layer upon layer, each learning representations more abstract than the last. Pixels become edges become textures become objects become scenes. The depth wasn’t metaphorical. It was the whole point.
And ImageNet was just the beginning. The same approach—deep neural networks trained on large datasets with GPU acceleration—started working for everything. Speech recognition. Machine translation. Game playing. Natural language processing.
Each breakthrough followed a similar pattern: someone would take a problem that had resisted decades of traditional AI approaches, throw a deep neural network at it with enough data and computation, and suddenly achieve results that seemed impossible a few years earlier.
The learning approach wasn’t just working. It was working better than anyone had expected.
Why Learning Wins
To understand why learning from examples beats programming rules, consider what these systems actually do.
Programming intelligence means specifying behavior in advance. The programmer analyzes the problem, identifies the relevant features, writes explicit rules for how to combine those features, and implements those rules in code. This works beautifully when the programmer’s analysis is correct and the problem is well-understood.
But most interesting problems—vision, language, motor control—are too complex for this approach. The “rules” are too numerous, too context-dependent, too subtle. We can’t articulate them because we don’t consciously know them. When you recognize a friend’s face, you’re not following a checklist of features. You’re doing something more holistic, more pattern-based, more implicit.
Learning from examples means extracting patterns from data. You don’t tell the system what features matter or how to combine them. You show it examples and let it discover the patterns. The system learns internal representations—ways of organizing and transforming information—that capture the statistical regularities in the training data.
The Universal Approximation Theorem provides the mathematical foundation for why this works. Neural networks aren’t just good at learning certain types of functions—they can approximate any function. This means that if there’s a pattern in the data, a relationship between inputs and outputs, a neural network can learn to capture it. The capacity is there, guaranteed by mathematics.
This works because:
1. Patterns exist that we can’t articulate. A child learns to speak by hearing language, not by studying grammar rules. The patterns of language are real and learnable, but they’re too complex and subtle to fully specify. Neural networks can learn these patterns the same way humans do—through exposure and practice. And the Universal Approximation Theorem guarantees they have the capacity to represent these patterns, even if we can’t write them down explicitly.
2. Hierarchical abstraction is powerful. Deep networks learn hierarchies of features. Early layers learn simple patterns (edges, textures, basic shapes). Middle layers combine these into parts (eyes, wheels, door handles). Late layers combine parts into objects (faces, cars, buildings). This hierarchical organization mirrors how we think the brain works. Each layer learns a function that transforms representations, and the theorem tells us these transformations can approximate any necessary mapping.
3. Transfer and generalization emerge automatically. When a network learns to recognize cats, it develops internal representations useful for recognizing other furry animals. When it learns language, it develops representations that capture syntax and semantics. These representations transfer to related tasks without additional programming. The network discovers general-purpose functions that work across contexts.
4. Scale unlocks capability. With enough data and computation, learning systems discover patterns and representations that even their creators don’t fully understand. The systems aren’t just memorizing—they’re extracting the underlying structure of the problem domain. The Universal Approximation Theorem says the capacity is there; sufficient data and training reveal that capacity.
The theorem explains the potential; training data and algorithms determine what gets realized. You could, in principle, hand-craft weights to implement any function. But you’d need to know exactly what function to implement. Learning discovers that function from examples. It navigates the vast space of possible weight configurations to find ones that work—not by being told the answer but by optimizing performance on training data.
This is fundamentally different from programming. You’re not implementing your understanding of the problem. You’re creating a system that develops its own understanding through experience—discovering functions that you couldn’t have specified explicitly but that the network can learn to approximate.
The Cost of Learning
This shift from programming to learning wasn’t free. It came with trade-offs that we’re still grappling with.
First, opacity. Programmed systems are transparent. You can read the code, trace the logic, understand exactly why a decision was made. Neural networks are opaque. You can inspect the weights, watch the activations, but you can’t easily explain why the network classified an image as a cat or translated a sentence a particular way. The representations are distributed across millions of parameters. The decision-making is emergent, not explicit.
Second, brittleness of a different kind. Symbolic systems failed on inputs they weren’t programmed for. Neural networks fail on inputs that are adversarially chosen—carefully constructed to exploit weaknesses in what the network learned. Show a neural network an image of a cat with imperceptible noise added, and it might confidently classify it as a dog. The network is vulnerable in ways we don’t fully understand.
Third, data dependence. Learning systems are only as good as their training data. If the data contains biases, the system learns those biases. If the data doesn’t cover certain cases, the system won’t handle those cases well. You can’t fix this by editing code—you have to change the training data and retrain, which is expensive.
Fourth, computational cost. Training state-of-the-art neural networks requires enormous computational resources. GPT-3, released in 2020, cost an estimated $4-12 million to train. Three years later, the cost of training GPT-4 soared to over $100 million. Only large companies or well-funded research labs can afford to train the largest models. This concentrates power in ways that symbolic AI never did.
Fifth, theoretical uncertainty. We don’t have a complete theory of why deep learning works as well as it does. We have intuitions, empirical observations, and partial mathematical analyses. But we’re building systems whose capabilities sometimes surprise even their creators. We’re doing engineering ahead of science.
These costs are real. They’re not reasons to reject learning—the results speak for themselves—but they’re reasons to proceed thoughtfully. We’ve traded explicit specification for emergent capability, transparency for power, and theoretical understanding for practical results.
Whether those trade-offs are worth it depends on your perspective. But by 2012, it was clear that learning worked in ways that programming never had.
The Paradigm Shifts
What changed between 1969 (when neural networks seemed dead) and 2012 (when they revolutionized AI)?
Not the fundamental idea. Rosenblatt’s insight—that machines could learn from examples—was right in 1958. Backpropagation, discovered in the 1970s and refined in the 1980s, provided the training algorithm. The core concepts were old.
What changed was:
Technology caught up to the idea. GPUs provided the computation needed to train large networks. The internet provided the data needed to train them effectively. These weren’t improvements to neural networks—they were changes to the environment that made neural networks practical.
Researchers persisted. Hinton, LeCun, Bengio, and others kept working on neural networks through the AI winter, through years of funding scarcity and professional marginalization. They believed learning was the right approach even when the field had moved on. Their persistence paid off when the technology was finally ready.
Results compounded. AlexNet’s success in 2012 proved deep learning worked. Other researchers adopted the approach. Each success made the paradigm more credible, attracting more researchers, leading to more breakthroughs. The field tipped—from skepticism to acceptance, from fringe to mainstream, in just a few years.
The shift was conceptual as much as technical. The deep learning revolution wasn’t just about better algorithms or more data. It was about accepting that intelligence might not be something you could specify explicitly. That learning from examples, rather than encoding rules, might be the path to general intelligence.
This was the paradigm shift. Not a new technique but a new way of thinking about the problem. Intelligence isn’t programmed—it emerges from learning. You don’t implement it—you cultivate it.
What We Learned
The shift from programming to learning taught us several things about intelligence:
Intelligence is statistical, not logical. Intelligence doesn’t operate on formal symbols through explicit rules. It recognizes patterns in high-dimensional data. It’s fundamentally about probabilities, correlations, and learned associations. Logic is something intelligence can do, not what intelligence is.
Representations are central. What you learn matters less than how you represent what you learn. Neural networks learn hierarchical representations—internal structures for organizing information. Good representations make hard problems easy. The quality of learned representations determines capability.
Learning is gradual and emergent. You don’t get intelligence from carefully crafted rules. You get it from millions of small adjustments, each improving performance slightly. Capability emerges from this process in ways that aren’t predictable beforehand. Small quantitative changes (more data, more layers, more training) lead to qualitative shifts in capability.
Data is the bottleneck. The limiting factor isn’t algorithms—the basic algorithms have been around for decades. It’s data. Systems learn from examples, so the quality and quantity of examples determine what can be learned. This makes data collection, curation, and labeling critical.
Theory lags practice. We can build systems that work before we fully understand why they work. This is uncomfortable for scientists but common in engineering. Empirical progress outpaces theoretical understanding, and that’s okay.
Scale matters more than we thought. Bigger networks, more data, longer training—these quantitative increases lead to qualitative improvements. There’s a scaling hypothesis: if learning approaches work better at scale, and we don’t know where they top out, then scaling might be the path to greater intelligence. This remains contentious but undeniably powerful.
The State of Play
By 2015, deep learning had transformed AI. The systems that dominated computer vision, speech recognition, machine translation, and natural language processing were all based on neural networks trained on large datasets.
But limitations remained. The networks were good at specific tasks they were trained for but couldn’t generalize broadly. They needed large labeled datasets, which were expensive to create. They were opaque and sometimes fragile. They learned correlations but didn’t truly “understand” in any deep sense.
Still, the paradigm had shifted. The question was no longer “Can learning work?” but “How far can learning take us?”
The next breakthroughs would push further. Transformers would unlock natural language understanding at scale. Reinforcement learning would enable systems that didn’t just predict but acted. Self-supervised learning would reduce data requirements. Each advance built on the foundation: learning from examples, not programming rules.
But all of these developments traced back to the core insight. To Rosenblatt’s Perceptron in 1958. To the patient researchers who kept working through the winters. To the technological developments that made massive computation feasible. And to the realization, arrived at through decades of trying and failing, that intelligence couldn’t be programmed but could be learned.
In retrospect, the shift seems obvious. Of course biological intelligence emerges from learning. Of course artificial intelligence would require the same. Of course explicit programming couldn’t capture the subtle, context-dependent, implicit knowledge required for general intelligence.
But it wasn’t obvious at the time. The most brilliant minds in AI spent decades pursuing symbolic approaches, convinced that intelligence was logical reasoning that could be formalized and programmed. They achieved remarkable things—theorem provers, expert systems, chess programs—before hitting fundamental limits.
The learning approach seemed crude by comparison. It was bottom-up where symbolic AI was top-down, statistical where symbolic AI was logical, implicit where symbolic AI was explicit. It required giving up control, accepting opacity, and trusting that patterns would emerge from data.
It worked.
Intelligence, it turned out, couldn’t be programmed. But it could be learned.
That realization changed everything. What comes next builds on this foundation. But this was the breakthrough that mattered most—the paradigm shift that made modern AI possible.
Notes & Further Reading
The Perceptron:
Rosenblatt, F. (1958). “The Perceptron: A Probabilistic Model for Information Storage and Organization in the Brain.” Psychological Review, 65(6), 386-408. Rosenblatt’s original paper is remarkably readable. His 1958 press conference and the subsequent media coverage are documented in various AI histories.
The Critique:
Minsky, M., & Papert, S. (1969). Perceptrons: An Introduction to Computational Geometry. MIT Press. This book was mathematically rigorous and technically correct about single-layer perceptron limitations. Its broader impact on the field was more about perception than content—many people who didn’t read it carefully assumed it proved neural networks were fundamentally limited.
Backpropagation:
Rumelhart, D.E., Hinton, G.E., & Williams, R.J. (1986). “Learning Representations by Back-propagating Errors.” Nature, 323, 533-536. This is the paper that made backpropagation famous, though the technique had been discovered earlier by others, including Werbos (1974) and others in the 1970s-80s.
Universal Approximation Theorem:
Cybenko, G. (1989). “Approximation by Superpositions of a Sigmoidal Function.” Mathematics of Control, Signals, and Systems, 2(4), 303-314. The original proof that neural networks with a single hidden layer can approximate any continuous function. Hornik, K., et al. (1989) extended this result. The theorem provides the mathematical foundation for why neural networks can, in principle, learn anything—though it doesn’t address the practical challenges of actually finding good solutions through training.
The Deep Learning Breakthrough:
Krizhevsky, A., Sutskever, I., & Hinton, G.E. (2012). “ImageNet Classification with Deep Convolutional Neural Networks.” Advances in Neural Information Processing Systems (NeurIPS). The AlexNet paper that launched the deep learning revolution. The results were so far ahead of previous approaches that the field couldn’t ignore them.
ImageNet:
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). “ImageNet: A Large-Scale Hierarchical Image Database.” CVPR. The dataset that enabled the deep learning revolution in computer vision. Building it required enormous effort—crowdsourcing labels for millions of images.
Why Deep Learning Works:
This remains partially mysterious. Some useful perspectives:
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. The standard textbook.
Zhang, C., et al. (2017). “Understanding Deep Learning Requires Rethinking Generalization.” ICLR. Shows that standard explanations don’t fully account for deep learning’s success.
The Pioneers:
Geoffrey Hinton, Yann LeCun, and Yoshua Bengio shared the 2018 Turing Award for their contributions to deep learning. Their persistence through the AI winter was essential. Hinton in particular kept working on neural networks when they were deeply unfashionable.
GPUs and Deep Learning:
The use of GPUs for neural network training started around 2009-2010 (Raina et al., Ciresan et al., and others). This wasn’t a new algorithm—it was recognizing that existing algorithms could run much faster on hardware designed for video games.


