Intelligence as Performance
The Imitation Game
This is Chapter 10 of A Brief History of Artificial Intelligence.
We began this book with Alan Turing and a question: Can machines think? Now, having traced the journey from symbolic AI through neural networks, from the winters of failure to the spring of deep learning, we return to that question—with seventy years of hindsight and systems that finally force us to confront it.
On a spring morning in 1950, Turing sat in his office at the University of Manchester, surrounded by the hum of computing machinery. The Manchester Mark 1—one of the world’s first stored-program computers—occupied the floor below, a room-sized maze of vacuum tubes, Williams tube memory, and magnetic drum storage. Turing had helped design it. Now he was thinking about something else entirely.
He was writing a paper that would define a field for the next seventy years.
“I propose to consider the question, ‘Can machines think?’” The opening line was deliberately provocative. Turing knew the question invited endless philosophical debate—debates about consciousness, about souls, about what “thinking” really meant. He wanted to sidestep all of that.
So he proposed a game. Put a human interrogator in one room, a machine in another, and let them communicate through text. If the interrogator cannot reliably tell which is which, Turing argued, we should credit the machine with intelligence. Don’t ask what’s happening inside. Ask only what comes out.
For decades, this seemed like a distant hypothetical. No machine came close to fooling anyone. Then, in late 2022, ChatGPT arrived. Within months, GPT-4 was scoring in the 90th percentile on the bar exam, passing medical licensing tests, acing the GRE. The Turing Test, once a thought experiment, was suddenly a practical reality.
But passing the test revealed something Turing hadn’t anticipated. A system could perform intelligence—flawlessly, convincingly—while doing something that looked nothing like understanding. The question wasn’t whether machines could pass the test. It was what passing actually proved.
The Imitation Game
Turing called his test “the imitation game,” and the name was precise. The machine’s job was to imitate a human well enough to fool the interrogator. Success meant indistinguishability. Nothing more, nothing less.
This was a radical move. Turing was a mathematician, precise in his thinking, and he knew that questions about consciousness and inner experience were probably unanswerable. You can’t peer into another mind. You can’t verify someone else’s subjective experience. Even with humans, we infer understanding from behavior—from what people say and do, not from direct access to their thoughts.
So why not apply the same standard to machines? If behavior is all we can measure, make behavior the test. A machine that behaves indistinguishably from a thinking being should be treated as one. Turing was, in essence, a behaviorist about intelligence: outputs were what mattered.
The proposal attracted immediate criticism. Couldn’t a machine fake it? Couldn’t clever programming produce the appearance of intelligence without the reality? Turing acknowledged the objection but wasn’t troubled by it. “May not machines carry out something which ought to be described as thinking but which is very different from what a man does?” he asked. Perhaps machine thinking would be different from human thinking. It would still be thinking.
For forty years after Turing’s death in 1954, the debate remained academic. The programs that attempted his test were brittle and easily exposed. ELIZA, a 1966 chatbot created by Joseph Weizenbaum at MIT, could briefly fool users by reflecting their statements back as questions—”I feel sad” became “Why do you feel sad?”—but any sustained conversation revealed its emptiness. Weizenbaum was horrified when people took ELIZA seriously. He had built a parlor trick, not a mind.
Then large language models arrived, and everything changed.
A Lawyer’s Mistake
In May 2023, Steven Schwartz faced a problem. A partner at the New York law firm Levidow, Levidow & Oberman, Schwartz was working on an aviation injury case—a client who had been hurt by a serving cart on an Avianca flight. He needed legal precedents, cases that would support his client’s claims.
Schwartz was sixty-two years old, a lawyer for over thirty years. But legal research had always been tedious, and there was a new tool that everyone was talking about. He opened ChatGPT.
“Can you find relevant case law for an aviation injury claim under the Montreal Convention?”
The AI responded confidently. It provided case names, citations, summaries of holdings. Varghese v. China Southern Airlines. Shaboon v. Egyptair. Petersen v. Iran Air. The citations were properly formatted. The summaries sounded authoritative. Schwartz incorporated them into his brief and filed it with the federal court.
Two weeks later, Avianca’s lawyers filed a response that began with an unusual statement: they could not locate several of the cases Schwartz had cited. Not because the cases were obscure—because the cases didn’t exist.
Judge P. Kevin Castel, a federal judge in the Southern District of New York, ordered Schwartz to produce the cases. Schwartz went back to ChatGPT, which cheerfully provided more details about the fabricated precedents. He submitted these to the court as proof the cases were real.
They weren’t. Not one of them. ChatGPT had invented case names, invented judges, invented legal reasoning—all with the fluency and format of genuine legal research. When Schwartz finally tried to find the cases in legal databases, he discovered the truth: he had submitted a brief full of hallucinations.
At the sanctions hearing, Judge Castel was incredulous. “The Court is presented with an unprecedented circumstance,” he wrote in his opinion. A lawyer had relied on artificial intelligence that “made up cases out of whole cloth.” Schwartz was sanctioned $5,000 and publicly reprimanded.
But the deeper issue wasn’t Schwartz’s carelessness. It was what the incident revealed about the nature of AI performance. ChatGPT had passed every surface test of legal competence. The citations looked real. The reasoning sounded real. A thirty-year veteran of the bar couldn’t tell the difference. The performance was flawless.
The performance was also completely empty. The system had produced the form of legal knowledge without any grasp of legal reality. It had performed legal research without doing legal research. The imitation was perfect. The thing being imitated was absent.
The Chinese Room
Three decades before ChatGPT, a philosopher named John Searle had predicted exactly this problem.
Searle was a professor at Berkeley, known for his work on speech acts and intentionality—the question of how minds relate to the world. In 1980, AI was in its first wave of hype. Programs like SHRDLU could manipulate blocks in a simulated world, responding to English commands. Researchers were proclaiming that machines would soon understand language as well as humans.
Searle thought this was nonsense, and he devised a thought experiment to show why.
Imagine, he said, a person locked in a room with a rulebook for manipulating Chinese characters. Messages in Chinese come through a slot. The person, who doesn’t understand a word of Chinese, looks up each input character in the rulebook, follows the instructions exactly, and produces output characters. To someone outside the room, the outputs are perfect Chinese—grammatically correct, semantically appropriate, indistinguishable from a native speaker’s responses.
Does the person in the room understand Chinese?
Searle’s answer was emphatic: no. The person is manipulating symbols according to rules. They don’t know that 猫 means cat, that 北京 means Beijing, that they’re discussing philosophy or ordering dinner. They’re executing a procedure. The outputs are correct. Understanding is absent.
Searle was attacking what he called “strong AI”—the claim that an appropriately programmed computer doesn’t merely simulate a mind but actually is a mind. If the program is right, proponents argued, understanding genuinely occurs. The Chinese Room was designed to demolish this claim. A system could produce perfect outputs while understanding nothing. The program could be flawless. The understanding could be absent.
The argument ignited decades of debate. Some critics argued that while the person doesn’t understand Chinese, perhaps the system as a whole—person plus rulebook—does understand. Searle dismissed this as the “systems reply” and considered it absurd: where would the understanding be located? Others questioned whether any rulebook could capture true Chinese competence. The debate filled philosophy journals, generating more heat than resolution.
But ChatGPT has made the Chinese Room concrete. When the AI produces legal citations, it’s manipulating symbols according to patterns. When it writes poetry or explains quantum physics, it’s manipulating symbols according to patterns. The patterns are more complex than Searle’s rulebook—billions of parameters instead of explicit rules—but the function is similar. Outputs that match what an understanding system would produce. Generated without any apparent understanding.
Searle maintained his position for over four decades, insisting until his death in 2025 that symbol manipulation could never constitute genuine understanding. Large language models, he argued, were simply more sophisticated Chinese Rooms—faster, more capable, but still fundamentally empty. The outputs had become more impressive. The absence of understanding, in his view, remained unchanged.
The question is whether he was right—or whether something has changed since 1980 that his thought experiment didn’t anticipate.
The Reasoning Illusion
In 2022, researchers at Google made a discovery that complicated the picture further.
Jason Wei and his colleagues had been experimenting with prompting techniques—ways of asking language models to produce better answers. They found something surprising: if you asked a model to “think step by step,” showing its reasoning before giving an answer, performance improved dramatically. Math problems that stumped the model when asked directly became solvable when the model was prompted to reason first.
They called the technique “chain-of-thought prompting,” and it seemed like evidence for genuine reasoning. The model wasn’t just pattern-matching; it was thinking through problems, step by step, like a human would.
But subsequent research revealed something troubling. In 2023, researchers at Anthropic and elsewhere began systematically studying what happened inside models when they produced chain-of-thought reasoning. What they found was a gap between the reasoning displayed and the reasoning performed.
Sometimes models produced correct answers with incorrect reasoning—arriving at the right destination through the wrong path. Sometimes they produced plausible-sounding reasoning that was disconnected from their actual processing. The shown work didn’t reliably reflect the work being done.
“It’s like a student who writes down steps to show the teacher, but actually solved the problem a different way,” explained Miles Turpin, one of the Anthropic researchers. “The reasoning is a performance. It’s not necessarily false—but it’s generated to look right, not because it’s how the model actually computed the answer.”
This phenomenon, sometimes called “unfaithful chain-of-thought,” reveals the gap between performance and process. When humans show our work, the work usually reflects our thinking. When models show their work, the shown work is generated by the same process that generates everything else: predicting what plausible text would follow. Plausible reasoning and actual reasoning are different things.
The models are performing reasoning. Whether they’re doing reasoning is another question.
What Tests Actually Test
GPT-4’s performance on professional examinations forced a reckoning with what tests actually measure.
The bar exam is designed to ensure that lawyers understand law well enough to practice it. The medical boards are designed to ensure that doctors understand medicine well enough to treat patients. These tests work because they’re hard to fake. A human who can answer complex questions about contract law probably understands contract law. The test measures visible outputs and infers invisible competence.
But AI breaks this inference. GPT-4 passes the bar exam without ever having a client, argued a case, or felt the weight of legal responsibility. It passes medical boards without examining a patient, smelling infection, or delivering bad news to a family. The correct answers are present. The experiential foundation is absent.
This doesn’t mean the tests are useless for humans. A human who passes the bar still demonstrates legal knowledge, because humans can’t pattern-match their way through law school. But the tests reveal their own nature: they measure performance, not understanding. They always did. We just assumed the two went together.
There’s a principle in social science called Goodhart’s Law: when a measure becomes a target, it ceases to be a good measure. If we optimize AI to pass tests, we get AI that passes tests—which may be different from AI that understands. The tests worked as measures of understanding only as long as nothing optimized directly for them. Now something does, and the measure has broken.
What would a test that can’t be gamed look like? Perhaps one that requires physical presence—examining an actual patient, arguing in an actual courtroom. Perhaps one that requires real-time adaptation to novel situations. Perhaps one that probes not just answers but the reasoning that produces them.
Or perhaps all tests can only measure performance. Perhaps that’s what tests are.
The Spectrum of Understanding
The debate over AI understanding often assumes a binary: either a system understands or it doesn’t. Searle’s Chinese Room demands a yes-or-no answer. But perhaps understanding exists on a spectrum.
Consider different levels. There’s behavioral competence—producing appropriate outputs in context. GPT-4 has this abundantly: it responds to questions in ways that match what an understanding system would do. There’s functional competence—using concepts productively, applying them to new situations, reasoning with them. GPT-4 has some of this, at least within domains similar to its training. There’s grounded understanding—connection to experience, to the physical world, to the felt sense of what concepts mean. GPT-4 lacks this entirely.
Perhaps these aren’t different kinds of understanding but different degrees. A child understands physics less than a physicist, but both understand it somewhat. A tourist understands a foreign city less than a native, but both can navigate. Maybe AI systems sit somewhere on this spectrum—more than mere lookup, less than full human comprehension.
This view has an uncomfortable implication. If understanding is a spectrum, where do humans sit? We also have gaps between performance and comprehension. Students pass tests without deep understanding all the time. Experts apply procedures without grasping foundations. We confabulate explanations for decisions we made intuitively. We pattern-match more than we like to admit.
How much of human intelligence is performance? How much is pattern recognition dressed up as reasoning? The AI mirror reflects these questions back at us. If we’re troubled by systems that perform without understanding, maybe we should examine how much of our own cognition works the same way.
Why Performance Works
If performance and understanding are separable, why does performance work so well? Why can systems that manipulate symbols pass bar exams and fool experts?
The answer has to do with the structure of intelligence itself.
When humans think—really think—we produce artifacts. Text, speech, diagrams, code. These artifacts aren’t random. They reflect the structure of thought: how arguments build, how evidence supports conclusions, how concepts relate. A legal brief reflects legal reasoning. A medical diagnosis reflects clinical thinking. A proof reflects mathematical logic.
A language model trained on these artifacts learns their structure. It learns what legal briefs look like, how medical explanations flow, what mathematical reasoning sounds like. In learning these patterns, it captures something about thought itself—not the inner experience of thinking, but the external form it takes.
This is why performance works: intelligence is structured, and structure can be imitated. A system that masters the form of legal reasoning can produce text indistinguishable from a lawyer’s—not because it reasons like a lawyer, but because lawyers’ reasoning has a characteristic shape, and that shape can be learned.
The next chapter explores this more deeply: intelligence as compression. To predict text well, a model must find patterns—regularities that capture how ideas relate to other ideas. These patterns are the structure of thought made visible. Capturing them is capturing something real, even if it’s not the same as understanding.
The performance of intelligence works because intelligence leaves traces. A perfect imitation of those traces looks exactly like the real thing. Whether it is the real thing—whether form suffices for substance—is a question that may not have a clean answer.
The Limits of Performance
Performance has limits, and the limits reveal what performance lacks.
Novel situations expose the gap. AI systems excel when new problems resemble training data—when answers lie within the space of examples they’ve seen. They struggle when problems require genuine extrapolation, when solutions lie outside the distribution. A system can perform brilliantly on variations of familiar problems and fail completely on genuinely new ones.
Adversarial examples show this starkly. Add imperceptible noise to an image, and a vision system’s confident classification can flip from “cat” to “toaster.” The noise exploits the gap between pattern-matching and understanding. A system that understood cats wouldn’t be fooled by invisible perturbations. A system that performs cat-recognition can be.
Compositional generalization reveals similar limits. Humans combine concepts freely: if you know “red” and “square,” you can imagine a red square you’ve never seen. AI systems struggle with novel combinations outside their training experience. They perform within the space of familiar examples but fail to extrapolate beyond it.
These failures don’t prove that AI lacks understanding—perhaps they just indicate incomplete understanding. But they suggest that performance and understanding, even if related, are not identical. Perfect performance within one domain can coexist with failure outside it. Understanding, whatever it is, should generalize more robustly.
Closing: The Question Sharpened
This chapter began with Alan Turing in Manchester, proposing a test that would define a field. It ends with systems that pass his test—and a question sharper than ever.
Turing asked whether machines could think. He proposed that behavior was the answer: if a machine behaves indistinguishably from a thinking being, call it thinking. For seventy years, no machine came close. Now machines pass bar exams, fool lawyers, and produce outputs that experts can’t distinguish from human work.
Searle argued that performance wasn’t enough—that the Chinese Room showed how symbols could be manipulated without understanding. We now have vast Chinese Rooms, processing billions of parameters, producing perfect outputs with no apparent inner life.
But perhaps the dichotomy is too simple. Perhaps understanding is not binary but spectral. Perhaps humans, too, perform more than we understand. Perhaps the question “does it really understand?” has no clean answer because understanding itself is not a clean concept.
What we know is this: intelligence can be performed. The performance is good enough to pass professional exams, good enough to fool thirty-year veterans, good enough to produce work that millions of people find useful. Whether the performance is accompanied by comprehension—whether anything lies beneath the outputs—remains an open question.
Turing, who died before any machine came close to his test, might have appreciated the irony. He asked whether machines could think. Seventy years later, we have machines that act as if they can. The performance was flawless. What lay beneath it was still unknown.
Notes and Further Reading
On Turing and His Test
Turing’s “Computing Machinery and Intelligence” (1950) remains astonishingly fresh, anticipating objections that would occupy philosophers for decades. For biography, Andrew Hodges’ “Alan Turing: The Enigma” captures both the mathematics and the tragedy of Turing’s life. The Manchester computing context illuminates why Turing was thinking about machine intelligence at precisely that moment.
On the Chinese Room
Searle’s “Minds, Brains, and Programs” (1980) sparked one of philosophy’s most sustained debates. The collection “The Philosophy of Artificial Intelligence” (edited by Margaret Boden) includes the original paper and major responses. Searle continued to comment on AI developments throughout his career, maintaining until his death in 2025 that symbol manipulation cannot constitute genuine understanding.
On the Schwartz Case
Judge Castel’s June 2023 sanctions order in Mata v. Avianca is publicly available and worth reading for its precise documentation of what went wrong. The case has become canonical in discussions of AI hallucination and has already influenced how bar associations think about AI use in legal practice.
On Chain-of-Thought and Faithful Reasoning
Wei et al.’s “Chain-of-Thought Prompting Elicits Reasoning in Large Language Models” (2022) introduced the technique. Subsequent work, including research from Anthropic on “unfaithful” chain-of-thought, has complicated the picture. The gap between displayed reasoning and actual processing remains an active research area.
On Standardized Test Performance
OpenAI’s technical reports for GPT-4 document exam performance in detail, including methodology for fair comparison with human scores. The results have prompted discussions about what standardized tests actually measure—discussions that predate AI but have become newly urgent.


