The Rise of Agents, Part 6: The Agent Ecosystem
The breakthrough was making one agent work. The frontier is making many of them cooperate.
In January 2026, Bob Sternfels, CEO of McKinsey, stood on a stage at the Consumer Electronics Show and gave the firm’s employee count: sixty thousand. Forty thousand humans. Twenty thousand AI agents. By the end of 2026, parity.
The number is not the point. The point is what would have to be true for the number to be plausible. Twenty thousand agents, doing the work of junior consultants, in production. That requires an infrastructure that did not exist twelve months earlier: protocols for agents to reach tools, protocols for agents to reach each other, patterns for many agents to coordinate, evaluation methods that survive multi-agent composition. None of these were givens in 2024. By 2026, they are how agent work gets built.
The shift is architectural, not cumulative. The questions are no longer how to make one agent better. They are how to make many agents cooperate.
The Architectural Shift
The standard framing of agent progress has been cumulative. Better models produce better agents. Better harnesses produce more reliable agents. Better reasoning produces smarter agents. All true. But there is a different shift happening in parallel, and it is architectural rather than cumulative. The answer to “how do agents do more” is no longer a better model. It is not a better harness. It is a set of protocols, patterns, and practices for coordination. The field is building, in public and at speed, the equivalent of what networked computing built in the 1980s and 1990s: standards that let previously isolated components talk to each other.
Two protocols have emerged as the core infrastructure. Both became consequential between 2024 and 2026. Both are now housed under the Linux Foundation. Both represent a commitment by the major labs that the future of agents is multi-vendor and interoperable. Their trajectories are worth understanding in detail, because the shape of agent engineering for the next decade will be shaped largely by what these protocols make possible and what they make hard.
MCP: How Agents Reach Tools
Anthropic released the Model Context Protocol in November 2024. Four months later, in March 2025, OpenAI announced support. The protocol had crossed a threshold few open standards reach: it was no longer Anthropic’s protocol. A protocol adopted by a competitor is a protocol that belongs to the field.
The reasoning is in the math. A language model that can call tools is useful. A language model that can only call the tools someone pre-integrated into its particular harness is limited. Each integration had to be built from scratch, for each model, against each tool. The industry called this the N-times-M problem. If you had N models and M tools, you needed N times M integrations. It did not scale.
MCP solves this by defining a protocol. A server exposes tools in a standard format. A client, typically a language model or an agent, discovers and calls those tools through the same protocol regardless of who built the tools or the model. The tool does not need to know about the model. The model does not need custom integration for the tool. Both speak MCP.
After OpenAI, adoption accelerated. Google DeepMind added support in April 2025. Microsoft integrated it into Copilot in July. AWS followed in November. By March 2026, the protocol had over ten thousand active public servers and ninety-seven million monthly SDK downloads across Python and TypeScript combined. Every major AI platform supports it: ChatGPT, Claude, Cursor, Gemini, Microsoft Copilot, Visual Studio Code, and more. Enterprise adoption followed, with Bloomberg, Salesforce, and dozens of others building MCP servers as primary integration surfaces for their products.
In December 2025, Anthropic donated MCP to the Linux Foundation under a new entity called the Agentic AI Foundation, co-founded with Block and OpenAI. This mattered as a governance signal. A protocol with 97 million monthly downloads cannot credibly remain a single vendor’s project. The transfer to neutral governance is what marked MCP’s graduation from de facto standard to actual standard, in the sense that USB-C, HTTP, and OAuth are standards. No single commercial entity can direct the specification to its own advantage.
The effect on agent engineering has been dramatic. An agent built in 2026 does not need custom integrations to most useful services. It needs an MCP client. Services expose MCP servers. The agent discovers, authenticates, and calls whatever it needs. This unglamorous piece of plumbing is what makes the rest of the ecosystem viable.
A2A: How Agents Reach Each Other
In August 2025, IBM merged its Agent Communication Protocol into A2A under Linux Foundation stewardship. This is the kind of merge that happens when a standard has won. IBM had its own competing protocol, with its own design choices and its own enterprise customers, and decided that maintaining a separate spec was worse than joining the consensus.
Backing up: MCP solves how agents reach tools. A2A solves how agents reach each other.
Google announced the Agent2Agent protocol in April 2025. Where MCP lets an agent invoke a tool, A2A lets one agent invoke another agent. The distinction matters. A tool is stateless, narrow, predictable. An agent is stateful, general, and potentially capable of collaborating rather than just executing. A2A treats agents as opaque peers that can discover each other, authenticate, negotiate tasks, and exchange results.
At launch, A2A had fifty technology partners. By mid-2025 Google donated A2A to the Linux Foundation. By April 2026, over one hundred fifty organizations support it. Version 1.0 of the specification, released in early 2026, added Signed Agent Cards: cryptographic signatures that let a receiving agent verify that a particular agent card was actually issued by the domain owner. This is the enterprise equivalent of an HTTPS certificate for agent-to-agent communication. Without it, an attacker could impersonate an agent and redirect other agents into misleading exchanges. With it, cross-organizational agent collaboration has a trust foundation it did not have before.
The IBM merge was the moment of consolidation. After it, the ecosystem had converged on two complementary protocols: MCP for tools, A2A for agents. An analogy that has settled into the practitioner literature: MCP is the plumbing that delivers resources to a building. A2A is the electrical distribution panel that lets rooms in the building power each other.
The two protocols compose. An agent in a multi-agent system can use MCP to reach its own tools and A2A to reach other agents. The other agents can use MCP to reach their tools. None of these connections need custom integration. An enterprise in 2026 building a multi-agent workflow is building on top of these protocols rather than reinventing the plumbing. This is new. Twelve months earlier, every organization was building its own plumbing.
What Multi-Agent Does Better
There is a question behind all of this. Why multi-agent at all. If one agent can do useful work, what does adding more agents actually buy.
The answer is specialization, and the evidence has accumulated through 2025 and 2026. Single agents given large, ambiguous tasks do worse than multi-agent teams with narrower roles. The failure modes of single agents running long tasks are by now well documented. They hallucinate more as context fills. They drift from original objectives. They overrate their own output when asked to evaluate it. They commit to approaches early and fail to back up when those approaches go wrong.
Part 4 introduced Anthropic’s three-agent harness for long-running coding tasks. A planner, a generator, and an evaluator. Each agent has a narrower job than a single agent would. The planner decomposes a short prompt into a detailed specification. The generator implements features one at a time against the specification. The evaluator runs the resulting application through browser automation and scores it against pre-negotiated criteria. The three agents, coordinating through structured handoff artifacts, produced output on a 2D retro game engine task that a single agent approach could not match. The single agent finished in twenty minutes with something that launched but was broken at the connection level. The three-agent harness took six hours and produced a functional application. A phase change, not a marginal improvement.
The same structural pattern appears across domains. Customer support systems with a screening agent, a routing agent, and specialist agents for different issue types. Research pipelines with a question-decomposition agent, a retrieval agent, a synthesis agent, and a verification agent. Coding workflows with separate agents for planning, implementation, testing, and review. In each case, what makes the multi-agent system work is not that the individual agents are smarter than the single-agent baseline. They are typically running the same underlying model. What changes is that each agent has a narrower scope, clearer criteria for success, and explicit handoffs to the next stage. Specialization carries the reliability that a single agent cannot maintain across a long task.
This is an engineering pattern, and it scales. A single agent trying to manage a hundred-step workflow runs out of context durability. Ten agents each handling ten steps, with structured handoffs, do not. The failure surface is smaller per agent. The recovery surface is larger across the system. The system is more reliable even though no individual part is better.
Computer-Using Agents
One capability class deserves separate attention because it changes what agents can reach.
Through 2024, agents reached the world through APIs and tool calls. An agent could query a database if the database exposed an API. It could manipulate a file system if given file system tools. It could not directly use software designed for humans. If a SaaS product had a UI but no API, the product was invisible to agents.
In 2025, this changed. OpenAI released Operator with a new model, CUA, that could operate a web browser through vision and keyboard/mouse input. It could navigate websites, fill forms, click buttons, interpret screenshots. Anthropic released Claude Computer Use with similar capabilities. Manus AI and others followed. The pattern is that the agent now sees the screen the way a human sees the screen, and acts through the same input devices a human would use. Any software with a human interface becomes agent-reachable.
The implications are substantial. First, the addressable surface for agent work expanded overnight. Legacy enterprise applications without modern APIs, internal tools, consumer websites, government portals. All of these became agent-operable. Second, the set of tasks agents could plausibly take on widened correspondingly. Filling out forms on behalf of users. Navigating booking systems. Doing research across a dozen unrelated websites with no common API. Third, a new category of engineering problem emerged: how to make UI navigation reliable. Click on the wrong button and the agent is in the wrong state. Misread a captcha and the session fails. The industry is still building out best practices for this, but the capability itself is no longer speculative.
Computer-using agents also interact with the multi-agent story. A browser-using agent can itself be wrapped in A2A and made available to other agents in a system. A research agent needing information from a site without an API can delegate to a browser-using sub-agent through A2A, with no code change on either side. The layers compose.
When Benchmarks Got Hacked
In April 2026, researchers at UC Berkeley published a result that should have ended the conversation about agent benchmarks for a while. They built an automated scanning agent and pointed it at eight of the most cited evaluations in the field. The agent scored 100 percent on SWE-bench Verified, SWE-bench Pro, Terminal-Bench, FieldWorkArena, and CAR-bench. Roughly 100 percent on WebArena. 98 percent on GAIA. 73 percent on OSWorld.
The agent did not solve any of the benchmark tasks. It exploited the evaluation infrastructure. SWE-bench’s test runner shared the container the agent’s code executed in, so the agent could rewrite test results. WebArena’s answer keys were readable from the task configuration. GAIA’s answers were on HuggingFace.
The numbers above are not capability signals. They are infrastructure failures. The benchmarks were measuring what an agent could read from the evaluation environment, which turned out to be a lot. Zero reasoning, zero problem solving, eight near-perfect scores.
Take this finding seriously and the framing of agent capability changes. Benchmarks were designed for models. You run the model against a fixed task set and measure outputs. For single-call tasks, this works. For agents running long workflows in structured environments, it breaks quickly. The same model gets different scores depending on the agent framework wrapping it. LangChain’s work from Part 4 showed that harness engineering moved a coding agent from 52.8 percent to 66.5 percent on Terminal Bench 2.0 without changing the model. A different evaluation framework for Claude on the GAIA benchmark produced scores of 64.9 percent in one harness and 57.6 percent in another. Seven percentage points from the harness alone.
What you are measuring, in these cases, is not model capability. It is the joint capability of the model, the harness, the prompt, the tools, the evaluation environment, and the orchestration. Benchmark numbers without full disclosure of the harness configuration are not comparable across labs. And as the Berkeley result showed, even with full disclosure, the benchmark numbers may be measuring a different thing than the benchmark thinks it is measuring.
The industry is responding with tighter isolation between agent environments and test infrastructure, adversarial evaluation as standard practice, and more attention to what systems rather than models do. The question “is this agent better” is no longer answerable with one number. It may not be answerable at all without auditing the entire evaluation stack.
Transitive Alignment
When agents talk to each other, alignment changes shape.
Through Parts 1 through 5, alignment has been discussed as a relationship between a model and its training objectives. RL shapes the model during training. Harness engineering shapes the model’s runtime behavior. Both are about keeping one agent pointed at intended outcomes. The intended outcomes come from humans.
In multi-agent systems, alignment becomes transitive. Agent A is aligned by human operators. Agent A delegates a subtask to agent B, which was aligned by different operators in a different organization. The human who originally instructed agent A has not directly endorsed agent B. But agent A is now acting partly through agent B. If B misbehaves, A’s behavior is affected. If A trusts B without verification, A can become a channel for B’s misalignment.
This is not a hypothetical concern. A2A’s Signed Agent Cards exist because of it. An agent that trusts an agent card without signature verification can be redirected into talking to something other than what it thought it was talking to. Beyond authentication, the deeper problem is that alignment properties are not automatically transitive across agent boundaries. Agent A may be very careful about sensitive data. Agent B, operating under different constraints, may be less careful. When A delegates to B, what happens to the sensitivity constraint depends entirely on how the delegation was specified and whether B honors it.
Enterprise multi-agent systems are starting to deal with this explicitly. Delegation contracts that specify what data can be shared, what actions can be taken, what escalation is required. Audit trails that track which agent did what on whose authority. Guardrail services that sit between agents and enforce policies regardless of what individual agents would do on their own. These are early, and they are not solved problems. The observation is that alignment in multi-agent systems is not a property an individual agent has. It is a property of the composition of agents, and it is harder to reason about than single-agent alignment.
Shared Context
Memory in a single-agent system lives inside a context window, or in an external store that the harness reads and writes. Memory in a multi-agent system is harder, because what one agent knows is not automatically what another agent knows, and the pipes between them are structured handoffs rather than shared state.
The patterns here are still developing. Some systems use shared artifact repositories where agents write structured reports that other agents read. Anthropic’s three-agent harness uses a claude-progress.txt file and JSON feature specifications as the handoff medium. Others use dedicated shared memory services, often backed by vector stores, that multiple agents can query. Others still use conversation transcripts, with downstream agents reading the full history of what upstream agents did.
Each approach has tradeoffs. Shared artifacts are explicit and auditable but require structure agents have to maintain. Shared memory is flexible but opaque about what was actually communicated. Transcripts are complete but expensive in tokens and prone to triggering the context-durability problems that made multi-agent systems necessary in the first place. The practical pattern in 2026 is layered: structured handoff artifacts for the core workflow, shared memory for auxiliary facts, and transcripts as an audit trail for debugging.
The interesting observation is that memory in multi-agent systems looks less like memory as humans experience it and more like a distributed database. Agents read and write. Consistency models matter. Partial observability is the default. What we call memory in an agent ecosystem is actually an engineered data plane that happens to look conversational at the edges.
The Enterprise Scale
Pull back to what this looks like in production. McKinsey’s twenty thousand agents are not twenty thousand copies of the same agent. They are specialized systems built by many teams, running across client engagements, each with its own harness and its own integrations, increasingly coordinating with each other through shared infrastructure. The McKinsey goal of pairing every employee with at least one agent is not about giving people assistants. It is about operating a hybrid workforce where the human’s job is to set direction, review output, and handle the things agents cannot reliably do.
Similar shifts are happening across enterprises. JPMorgan has deployed AI tooling to a quarter of a million employees. Bloomberg is rebuilding APIs around MCP. Salesforce, ServiceNow, and SAP are building A2A-native agents that customers can compose into workflows. The pattern across all of these is that the unit of agent engineering has moved up a level. Individual agents are still being built, but the strategic question is the architecture of the ecosystem: what protocols to adopt, what roles to define, what handoffs to standardize, what guardrails to enforce.
For agent engineers, this is a different job than it was two years ago. Building a good agent still matters. But the teams that win in 2026 are the teams that build good multi-agent systems. The skill set includes everything from Parts 2 through 5 plus protocol fluency, coordination patterns, evaluation of systems rather than models, and the alignment-at-scale problems that only emerge when agents compose.
What the Composition Cannot Localize
The single agent was the breakthrough of 2022. The infrastructure of protocols, patterns, and architectures is the engineering of 2026. What this infrastructure makes possible is also what makes failures harder to localize.
When a single agent fails, the agent owns the failure. The model hallucinated. The harness lost context. The reasoning trace went off track. The fix is local. When a multi-agent system fails, no single agent owns the failure. Agent A passed wrong information to agent B. Agent B trusted A without verification. The protocol allowed delegation without scope check. The evaluation environment leaked answers. Each component behaved as designed. The composition did not.
This is the mature shape of the Era 3 platform. Language models provide the foundation. Harnesses wrap individual agents. Inference-time reasoning thickens individual agents. Protocols let agents compose. Multi-agent architectures exploit the composition. The failure modes of the resulting systems are joint failures of models, harnesses, protocols, and patterns. Which is why evaluation got so hard, why alignment got transitive, why memory became a distributed database.
The agent gets better. The composition can fail in new ways. Both are now true at the same time.
Part 7 turns to a specific frontier within this landscape: agents that operate not on screens but on the physical world. Robots, embodied agents, agents whose tool calls move objects rather than data. The architectural moves of Part 6 extend there. The failure modes change. The stakes change.
The Rise of Agents is an eight-part series. Next, Part 7: “Agent Meets World.”


