The Rise of Agents, Part 4: The Harness
The model is commodity. The harness is where the agent lives or dies.
In February 2026, Mitchell Hashimoto, co-creator of Terraform and founder of HashiCorp, published a blog post describing a habit he had developed while working with AI agents. Every time an agent made a mistake, instead of just fixing the output, he would engineer a permanent fix into the agent’s environment. A new constraint. A better tool. A clearer instruction. A checklist the agent had to run before finishing. He called this habit engineering the harness.
Within weeks, OpenAI, Anthropic, LangChain, and Martin Fowler had all published substantial treatments of the same idea. By March 2026, “harness engineering” was an emerging discipline with primary sources from three major labs, a growing practitioner literature, and a precise claim at its center. The claim is that the system around the agent matters more than the agent. Not in some abstract sense. Measurably, in ways that dominate model capability as the primary driver of reliability.
The vocabulary is 2026. The practice is older. Every production agent since ReAct has had some version of a harness: system prompts, tool definitions, error handlers, retry logic, memory scaffolding. What changed in early 2026 was that the practice acquired a name, a unified description, and three labs’ worth of evidence that the practice mattered more than most of the field had previously acknowledged.
This article is about that claim, the evidence for it, and what it implies about where agent capability actually comes from. The harness is the part of the agent that gets built, not the part that gets trained. And in 2026, it has become the part that determines whether an agent works.
What Is a Harness
“Harness” as a term of art in agent engineering sits at a specific level of the stack. It is not the model. It is not the application. It is what goes between them.
A useful mental picture. The model can reason in natural language, call tools, and produce outputs. The application wants to do something in the world: fix bugs, answer questions, build websites, execute trades. The harness is the engineered system that connects these. System prompts that orient the model to its task. Tool definitions that expose external capabilities. Middleware that modifies the model’s inputs and outputs before and after each call. Memory systems that persist context across turns. Verification loops that check outputs. Sub-agent delegation patterns. Error handling. Context management. All of this, plus the logic that threads it together.
LangChain’s engineering team describes the harness as having, in their phrasing, a lot of knobs: system prompts, tools, hooks, middleware, skills, sub-agent delegation, memory systems, and more. OpenAI’s Codex team frames the harness as three categories. Context engineering, meaning what information the agent sees. Architectural constraints, meaning what rules and boundaries apply. Lifecycle management, meaning how the agent operates across time and sessions. Both decompositions point at the same thing. The harness is the engineered environment in which the agent does its work.
None of this is model capability in the strict sense. The same model can operate in many different harnesses. And here is where the 2026 evidence gets interesting.
The Evidence
In February 2026, LangChain’s engineering team published an experiment. They had a coding agent built on GPT-5.2-Codex, scoring 52.8 percent on Terminal Bench 2.0. This score put the agent outside the Top 30. The team wanted to improve it. They did not change the model. They changed only the harness. Over a few weeks of iterative work, they moved the score from 52.8 percent to 66.5 percent. Same model. Same weights. Same training. Different environment around the model. The agent moved from outside Top 30 to Top 5 on the benchmark.
The 13.7 percentage point improvement came from a specific set of moves. Context middleware that mapped the agent’s working directory on startup so it did not waste time and tokens discovering it. Prompting changes that forced the agent to verify its own work against task specifications rather than re-reading its own code and declaring it fine. A reasoning budget allocation that spent high compute on planning and verification and moderate compute on implementation, because maximum reasoning at every step caused timeout failures. Middleware to detect and break repetition loops. Explicit onboarding about how the agent’s code would be tested programmatically, so it wrote code that could pass those tests.
None of these are model changes. All of them are harness changes. The cumulative effect on benchmark score was larger than what typical new-model releases deliver.
OpenAI’s evidence is more dramatic in scale. In early February 2026, the Codex team published a report from an internal experiment. For five months, a small team of engineers had been building a production software product, one million lines of code, with zero lines manually written. Every line, every test, every CI configuration, every documentation file, was written by Codex agents. The engineers’ job was to design the harness. When Codex made a mistake, they asked what capability was missing, and built it into the environment. What abstractions should agents reach for. What conventions should they follow. What background tasks should continuously enforce code quality. The product shipped, deployed, broke, and got fixed, like any other production system. The team’s estimate was that they built it in about one-tenth the time it would have taken to write the code by hand.
The OpenAI report reads as an engineering diary, not a marketing document. The tone throughout is that the model was always capable. What they had to build was the environment in which the capability could be exercised reliably. The report describes specific failures and the harness fixes applied. A giant instruction file became a directory of targeted docs, because a single monolithic file crowded out actual task context. A one-off quality check became a scheduled background task, because human taste captured once and enforced continuously works better than catching drift in periodic bursts. A hand-off from one agent session to another became a structured artifact rather than a conversation summary, because agents are better at reading fresh state than inheriting someone else’s context.
Anthropic’s contribution arrived in March 2026. A three-agent harness for long-running autonomous coding. A Planner expands a short product prompt into a fuller specification, deliberately leaving implementation details unspecified, because early over-specification cascades into downstream errors. A Generator implements features one sprint at a time, writing code and tests. An Evaluator runs Playwright-based browser automation to interact with the running application and score it against the sprint’s contract, criteria negotiated with the Generator before code is written. If evaluation fails, the sprint fails and the cycle repeats with revised scope.
Anthropic tested this architecture against a solo agent on the same task: build a 2D retro game engine. The solo agent produced something that technically launched in twenty minutes for about nine dollars. The three-agent harness ran for six hours, cost about two hundred dollars, and produced a richer, more polished, more functional application. The gap was not a marginal improvement in quality. It was a change in what kind of output the system was capable of producing at all.
Three labs. Three independent demonstrations. The pattern is the same. In fixed-model experiments, harness improvements produced larger capability gains than any model upgrade in the same period. This is the evidence for the harness claim.
Why the Model Is Not Enough
There is a specific reason the harness matters this much. It has to do with what the model actually is.
A language model, in the strict sense, is a stateless function. You pass it tokens. It returns tokens. Each call is independent. The model has no memory of previous calls, no awareness of where in a longer task it sits, no way to track its own progress, no persistent understanding of its environment. Everything the model knows about the current situation must be packed into its context window for each call.
An agent, in the functional sense, needs almost everything a stateless function does not have. Memory of what it has done. Awareness of its goals. Ability to recover from errors. Knowledge of what tools exist and when to use them. Judgment about when the task is complete. These are not model capabilities. They are system capabilities. The harness is what supplies them.
This becomes acute as task length grows. A model answering a single question works fine without a harness. A model running fifty tool calls to complete a task does not. Each call consumes context. By the fiftieth call, the model’s view of what it was asked to do in the first place may be compressed, displaced by intermediate results, or contaminated by irrelevant details. The industry term for this is context durability. How well does the model follow its original instructions after its hundredth tool call. The answer, for any frontier model in 2026, is: not well enough without help.
Context durability is a harness problem. Approaches vary. Some harnesses run summarization passes that compress history and preserve key facts. Some use context resets where the session is cleared entirely and the next agent picks up from structured artifacts rather than inheriting prior context. Some use scheduled re-anchoring, where the original goal is reinjected into the context at regular intervals to prevent drift. None of these are model improvements. All of them are harness improvements. All of them address the gap between what a stateless function does naturally and what a functional agent needs.
The same pattern holds for other agent properties: tool use consistency, error recovery, goal persistence, output verification, multi-step planning. All of them live in the harness. What the agent is, experienced by a user, is mostly what the harness is. The model is the substrate. The harness is where the agent lives.
Coding as the Experimental Ground
Every concrete example so far has been a coding agent. Coding has become the proving ground for harness engineering because it is the domain where the experimental apparatus is cleanest. Code compiles or does not compile. Tests pass or do not pass. A pull request is either merged by an automated check or it is not. A benchmark like Terminal Bench 2.0 runs tasks in containers with clear pass-fail outcomes. The feedback signals are abundant, objective, and fast. For a discipline like harness engineering, which depends on iterating against measurable outcomes, this matters enormously. It is much harder to iterate on a legal research agent or a customer support agent where ground truth is subjective and feedback is slow.
The situation resembles how molecular biology used fruit flies in the twentieth century. Drosophila is not interesting for its own sake. Its short generation time, cheap maintenance, and well-characterized genetics made it the species against which hypotheses could be tested quickly. Genetics is not a science about fruit flies. It used them. Harness engineering is not a discipline about coding agents. It uses them.
The structural findings transfer. Context engineering: not coding-specific. Verification loops: not coding-specific. Sub-agent decomposition: not coding-specific. The reasoning sandwich pattern, meaning high compute at planning, moderate at execution, high at verification, which took LangChain’s agent from 52.8 to 66.5, is not Claude-specific or GPT-specific or Codex-specific. It is a property of how attention-based models trade reasoning depth against execution latency. It applies wherever language agents work on bounded tasks with timeouts.
Later articles in the series cover agent applications beyond coding, including physical embodiment in Part 7. The principles established here carry forward. The coding examples are the experimental base.
The Inversion
There is a rhetorical shift worth naming. The conventional wisdom about agents, from 2022 through much of 2024, was that better models produce better agents. If your agent underperforms, wait for the next model. The implicit assumption was that model capability is the binding constraint.
The 2026 evidence inverts this. In February and March of 2026, three frontier models were released in twenty-three days: GPT-5.4, Gemini 3.1 Ultra, and Grok 4.20. The capability gap between top labs compressed to weeks. Meanwhile, the capability gap between agents using these models and agents using them better grew. A LangChain agent on the same model as a competing agent could score 13.7 points higher because the harness was better. A Claude Opus model scored 64.9 percent in one evaluation framework and 57.6 percent in another, on the same benchmark, because the harness differed. Seven percentage points from the harness alone.
The industry shorthand for this is: the model is commodity, the harness is moat. A startup or an enterprise team cannot reliably out-compete frontier labs on model capability, because frontier labs have the compute, data, and talent concentration needed to train frontier models. What teams can compete on is harness engineering. Trace analysis, failure mode cataloging, middleware design, sub-agent architecture, verification patterns. All of this is available to any team with enough discipline to iterate systematically. And all of this, in 2026, pays off more per unit of effort than waiting for better models.
This is the inversion. Conventional wisdom: agent capability comes from model capability. Current evidence: agent capability comes mostly from harness capability, given a frontier-grade model as substrate. The substrate matters. Frontier models are what make harness engineering worthwhile. But between two teams working with the same substrate, the one with the better harness wins.
Self-Improving Harnesses
One thread in the OpenAI report deserves attention as a glimpse of where this is heading. When the Codex team noticed their agents drifting from preferred code patterns, they did not just write documentation about the preferred patterns. They set up background Codex tasks to continuously scan the codebase for deviations and open targeted refactoring pull requests. The harness itself had agents in it, whose job was to maintain the harness’s invariants. Agents improving the environment in which other agents worked.
This is a specific and limited form of self-improvement. The agents are not deciding what the invariants should be. Humans decided that. What the agents do is enforce the invariants continuously and cheaply, so human taste captured once propagates across all future code without requiring human attention to each line. Call this harness-level self-improvement, or agents improving their own tools.
Hold this carefully. A system where agents continuously improve their working environment looks, in isolation, like something approaching autonomy. But the direction of improvement is set by humans. The agents are optimizing against criteria someone else defined. Self-improvement at this level is powerful execution, not self-direction. The series will return to this distinction in Parts 7 and 8, where the stakes get higher. For now: the 2026 harness contains agents that improve the harness. That much is real. What the harness cannot do, yet, is decide what the harness should be.
One more observation about this pattern. Part 2 described how reinforcement learning shapes language model behavior during training, making the model helpful, honest, careful in ways the base model is not. The harness carries this work forward at runtime. When OpenAI’s background agents enforce code quality invariants, or when Anthropic’s Evaluator agent scores a Generator’s output against pre-negotiated criteria, the harness is doing alignment work at runtime that RL did at training time. The model comes shaped from the training process. The harness re-shapes it continuously as the model operates, for things RL could not anticipate or could not shape reliably. Alignment is not only a training-time phenomenon. It is increasingly a runtime phenomenon, built into the harness, acting on every agent step.
Trust and What It Costs
The harness claim has an edge the primary sources do not always emphasize. If the harness is what makes the agent reliable, then trusting the agent means trusting the harness. And the harness is complex, evolving, often opaque.
Consider what it means to deploy a coding agent with full commit access to a production repository. What the agent does, in any moment, is the joint product of the model’s output, the system prompt, the tool configuration, the middleware, the memory system, the verification logic, and the sub-agent delegation structure. The user cannot see most of this directly. What the user sees is the agent’s behavior. When the behavior is good, the user trusts the system. When the behavior goes wrong, the cause could be anywhere in the stack.
Harness engineers have responded to this with an emerging practice: traces. Every agent action, every tool call, every reasoning step is logged to an observability system. When the agent fails, the trace is the evidence. LangChain’s iterative improvement loop runs on traces. OpenAI’s debugging practice runs on traces. Anthropic publishes detailed engineering writeups that effectively are annotated traces.
Traces are valuable. They are also not sufficient. A trace tells you what happened. It does not tell you why the harness was configured to make that the likely behavior, or what unseen choices in the harness shaped the option space the agent selected within. The answer to “why did the agent do this” in a production harness is often: because the harness made it likely. Getting to the root cause requires inspecting not just the trace but the harness design. And the harness design, for any serious production agent, is complex enough to be someone’s full-time job to understand.
This is a kind of trust architecture the industry is still working out. How much harness transparency should a customer demand. What parts of a harness are proprietary versus safety-relevant. Whether third-party harness audits will become standard. These questions do not have settled answers in 2026. What is settled is that they exist, and that their answer determines what trusting an agent actually means.
The Discipline Takes Shape
Pull back. Mitchell Hashimoto’s blog post in February 2026 named something that had been happening without a name. Within weeks, three major labs published their own treatments. Within a month, practitioners were publishing pattern libraries. By April 2026, a mid-career engineer could be described as a harness engineer and other practitioners would know what that meant. The discipline has a name, primary sources, worked examples, and a growing theoretical frame.
What the discipline does not yet have is stability. Harness patterns that work for GPT-5.2-Codex may not work for the next frontier model. Patterns that work for coding may not transfer cleanly to legal work or customer support or embodied agents. The field is in a period of active invention, where best practices are being discovered and codified and then sometimes invalidated as models change. This is appropriate for a discipline two months old. The appropriate stance for practitioners is to engineer harnesses that are, in the LangChain team’s phrase, rippable: designed to be rebuilt as the underlying model capabilities shift.
What will remain stable, probably, is the principle the discipline rests on. Agents are systems, not models. System properties matter as much as model properties. Reliability is built into the environment around the agent, not optimized out of the agent itself. This is true whether the current harness patterns hold or evolve. It will be true of the next generation of agent engineering too.
What Agents Are
The three articles before this one traced how language agents came to exist. This article is about how they become reliable in practice. The answer is not something internal to the model. It is the engineered system around the model. The harness takes a model that would, on its own, drift and hallucinate and give up prematurely, and turns it into something that ships production code, runs multi-hour autonomous sessions, and completes complex real-world tasks.
This is why the 2026 evidence matters for the series’ larger argument. If agents were mostly model, then the story of agents would be the story of better models, and the questions at the summit would be questions about what models can and cannot do. Instead, agents are mostly systems wrapped around models. The story of agents is the story of what systems can make models into. And the questions at the summit, which the series will reach in the later articles, are partly questions about what systems cannot yet make models into, regardless of how much harness effort is applied.
Part 5 turns to another development that has been reshaping this picture in parallel. Inference itself is becoming a site of agent behavior. Reasoning models that think for long stretches before producing an output are not just models with better training. They are models that run differently at inference time. If the harness is the environment around the agent, inference-time reasoning is the environment inside the agent. Both matter. Both are changing.
The Rise of Agents is an eight-part series. Next, Part 5: “Inference as Agency.”


