The Rise of Agents, Part 3: The ReAct Moment

The loop was forty years old. The substrate was new. The combination started a new era.

Apr 25, 2026

In October 2022, a paper appeared on arXiv: “ReAct: Synergizing Reasoning and Acting in Language Models.” The authors, led by Shunyu Yao at Google Research, proposed something simple. Have a language model produce interleaved thoughts, actions, and observations. Think about what to do. Do it. See what happened. Think again. Repeat until done.

The paper ran the approach on a few benchmarks. On ALFWorld, a text-based simulation of household tasks, ReAct prompting beat specialized reinforcement learning systems by 34 percentage points of absolute success rate. On WebShop, a simulated online shopping environment, it beat them by 10. The improvements came with one or two in-context examples. No fine-tuning, no policy learning, no reward engineering. Just the loop, a language model, and a problem.

What the paper showed was that language models could act as agents if you let them reason out loud about what they were doing. What the paper did not show, because it was not the paper’s subject, was that the loop it described was older than most of its readers realized. The deliberate-act-observe structure had been sitting in agent architectures for forty years, waiting for something it could run on.

That is the subject of this article. Not ReAct itself, which is now canonical. The earlier loops that failed, the specific change that made the 2022 version succeed, and what the success reveals about what language agents actually are.

The Loop Before ReAct

The basic structure of a reasoning agent is not new. An agent that deliberates about its situation, acts, observes the result, and deliberates again is a structure that the symbolic agent era made explicit and built extensively.

SOAR, developed at Carnegie Mellon starting in the early 1980s, is one version. An agent has a working memory representing the current state, a set of production rules that propose actions, a mechanism for selecting among proposed actions, and a decision cycle that executes the selection and updates working memory. The cycle runs continuously. Each pass through the cycle corresponds to a moment of deliberation and action.

ACT-R, also from Carnegie Mellon, is another version with a different theoretical grounding. PRS and its descendants dMARS and JACK, from the Australian AI Institute, are a third, built around the BDI architecture introduced in Part 1. In all of these, an agent is a loop. State comes in. Reasoning happens. Actions go out. Observations come back. The loop runs again.

The loop was the structure agent researchers agreed on. The disagreements were about what went inside it. What representations, what inference mechanisms, what forms of memory, what architectures for selecting among competing possibilities. Entire subfields debated these. But the loop itself was taken for granted.

Robotics control stacks arrived at the same structure from a different direction. A sense-plan-act cycle, running at whatever frequency the hardware demanded. Sense the environment through cameras and proprioception. Plan the next motion. Execute it. Repeat. The specific architectures varied wildly, but the cycle was the same as the symbolic planners and the cognitive architectures. Different communities, working independently, converged on the same loop because the loop was what the problem demanded.

The loop worked. What ran inside it did not.

Symbolic loops failed because the representations inside them could not scale to the real world. The frame problem, from Part 1. The loops worked on toy problems and specific domains. They broke down in open environments. Not because the loop structure was wrong, but because the ingredients the loop had to work with were insufficient. A SOAR agent could reason beautifully about blocks-world stacking and completely fail at a kitchen task it had not been hand-modeled for. A BDI agent’s plan library was only as rich as the set of plans a human had written for it. The loop could think, but it could only think with what had been put inside it.

Learning agents mostly did not have explicit loops at all. Reinforcement learning systems had a policy that mapped observations to actions, trained end-to-end to maximize reward. The deliberation, such as it was, happened inside the neural network weights, invisibly. An RL system acting in an environment looks, from the outside, like a fast loop of observation-action-observation-action. But there is no explicit moment where the agent considers what to do. The “thinking” is folded into the policy.

This worked for tasks where the policy could be learned from interaction. It failed everywhere else. The Era 2 failure was inherited by any attempt to use RL for open-world tasks. No prior knowledge, no pre-existing reasoning capability, nothing to fall back on when the trained policy encountered something outside its training distribution.

By the early 2020s, the loop had two failing traditions. The symbolic tradition, which had the right structure but the wrong representations. The learning tradition, which had the wrong structure but powerful representations. Neither had the combination.

This is worth sitting with. The loop itself was not in dispute. Forty years of agent research had converged on think-act-observe as the correct skeleton of an agent. The problem was that the skeleton needed muscle and blood to function, and neither tradition had figured out how to supply both. Symbolic systems could think carefully about what they represented but could not represent enough. Learning systems could absorb enormous amounts of data but could not think carefully about what they had absorbed. Both traditions knew this was the problem. Neither had a path to solving it.

What the field needed was a way to get rich representations into something that could reason over them in the loop. That is what a pretrained language model is. The representations are in the weights, implicit and enormous. The reasoning happens through the model in every forward pass. Put this inside the loop, and suddenly both sides of the old dichotomy are satisfied at once.

The 2022 Change

What ReAct showed was that the old loop, applied to a pretrained language model, worked.

Not “worked better than before.” Worked in a way that had no precedent. A frozen language model, prompted with a few examples of the think-act-observe pattern, could solve tasks that reinforcement learning systems specifically designed for those tasks could not solve. A generic loop on a generic model outperformed specialized approaches with years of tuning behind them.

The reason is the one Part 2 ended on. The loop did not change. The substrate did.

Every step in the ReAct loop runs through a language model. The thought step is the model producing natural language about the current situation, what the goal is, and what might be done next. The action step is the model producing a structured command, typically a tool call. The observation step is the environment’s response, handed back to the model as more text. The next thought step happens with all of this in context.

A concrete trace makes the pattern legible. The ReAct paper’s HotpotQA examples look roughly like this. The question is asked. The model thinks: to answer this I need to find out X, and I can search Wikipedia for X. The model emits an action: Search[X]. The environment returns a Wikipedia snippet. The model thinks: this tells me Y but not Z, I should search for Z. The model emits another action: Search[Z]. The loop continues until the model thinks: I now have enough to answer, and emits Finish[answer]. The reasoning is legible. The actions are legible. The observations are legible. The whole trajectory can be read by a human and understood.

What makes the loop work is that every step benefits from what the language model already knows. The thought is grounded in the model’s understanding of the world, its implicit knowledge of planning, its training-derived familiarity with how humans handle problems like the one at hand. The action is chosen based on the model’s knowledge of which tools exist, what they do, and when each is appropriate. The observation is interpreted with the model’s understanding of what the response means.

A symbolic agent running the same loop would hit the frame problem. Its representations would be too thin to support the thought step. An RL agent would have no native capacity for the thought step at all. The language model brings everything the earlier agents lacked. The loop does nothing new. The loop’s contents do everything new.

This is why ReAct was a moment rather than an invention. The loop had existed. The model had existed. What had not existed was the observation that you could just put them together and have the thing work.

Why the Loop Mattered

There is a subtle point about why the loop matters at all, given that modern language models can do so much without one.

A language model without a loop is a text generator. You give it a prompt. It produces tokens. It stops. Whatever reasoning it does is internal, compressed into the forward pass that turns input tokens into output tokens. The model can solve problems by reasoning about them in this compressed way, and with chain-of-thought prompting it can extend its reasoning across multiple output tokens. But the reasoning happens inside one generation, not across multiple turns of engagement with the world.

The loop breaks generation into turns. Within a turn, the model can reason. Between turns, the environment responds. The model’s reasoning on turn N has access to what happened on turns 1 through N-1. If an action produced an unexpected result, the next reasoning step knows. If a tool returned an error, the next reasoning step knows. If a plan needs to be revised, revision is possible because reasoning is happening inside a loop that keeps going.

The hallucination case is the clearest example. A language model asked a factual question it does not know will often produce a plausible-sounding but wrong answer. There is no mechanism inside a single forward pass to distinguish knowing from confabulating. The model generates whatever tokens the distribution favors, and for questions at the edge of its knowledge, the distribution favors something that sounds right. Chain-of-thought reasoning makes this worse as often as better: the model reasons confidently through steps that are individually hallucinated, compounding the error.

The loop breaks the pattern. The model in a ReAct loop can decide to look something up before answering. The action is Search[X]. The observation is what the world says about X. The next thought step is informed by what the world said, not by what the model thought the world might say. Hallucinations still happen, but now they happen against a backdrop of actual data the model is being asked to integrate. The correction mechanism is not perfect. It is a structural fix for a structural problem: a stateless generator checking its own claims against something that is not itself.

Without the loop, there is no opportunity to revise. The model produces a plan, or an answer, or an action, and that is the output. Any mistake is propagated. Any missing information is hallucinated. The ReAct paper’s original argument was partly about this. Chain-of-thought alone is vulnerable to propagating errors in its own reasoning. The loop, because it involves checking the world between reasoning steps, is a correction mechanism.

A language agent is what you get when you put a language model inside this correction mechanism. The mechanism is old. The model is new. The combination is what the field has been building on ever since.

What Came Next

The 2022 paper was the moment of recognition. What followed was the rapid build-out of everything the moment enabled.

Prompt scaffolds that implemented the ReAct pattern became standard. Frameworks like LangChain and LlamaIndex emerged to make the pattern easier to deploy. Tool-calling conventions, which started as ad-hoc prompt engineering, became protocols, most visibly MCP. Agent loops got more elaborate: separate reasoning and planning modules, explicit memory systems, multi-agent architectures where different agents played different roles in the overall loop.

The post-training moves covered in Part 2 started targeting the loop directly. Reasoning models trained to think longer within a turn. Agentic post-training shaping behaviors like tool use, error recovery, and goal persistence across turns. The loop went from prompt-engineered to trained-in.

By 2026, the industry has converged on what the ReAct paper’s structure looks like in production. A language model, post-trained for agentic behavior, running inside a harness that manages tool calls, memory, and error recovery. The thought-action-observation cycle is still recognizable. But it now sits inside infrastructure that did not exist in 2022, and the next four articles in this series are about what that infrastructure looks like.

Part 4 covers the harness itself. Part 5 covers what happens when reasoning moves from prompt-time to inference-time. Part 6 covers what happens when multiple agents run loops together. Part 7 covers what happens when the loop extends beyond text into the physical world.

What the Moment Revealed

Two things worth holding onto from the ReAct moment.

First, the continuity with earlier agent research is real, not rhetorical. The agent researchers of the 1980s and 1990s were not wrong about what an agent is. They were right about the structure. What they lacked was the substrate. The field spent decades perfecting the loop with ingredients that could not support it, then had to wait for ingredients that could. The current agent era is a continuation of that earlier work, not a break from it.

Second, the loop is still the loop. Modern harnesses are elaborate. Multi-agent architectures are elaborate. The infrastructure around language agents is elaborate. But the core structure is still deliberate-act-observe-deliberate. Every production agent today runs this cycle. The complexity is in what each step does and how the environment is managed between them. The shape of agent operation has not changed.

What has changed, and what the series will trace from here, is what this loop can be made to do when you keep pushing on it. Better harnesses. Longer reasoning. More agents. More environments. The substrate keeps improving. The loop keeps scaling. What happens at the limit of that scaling is the question the series is built toward.

A Quiet Pivot

The ReAct moment was quiet, in the way that many pivotal moments are quiet. A paper on arXiv, a few benchmark improvements, a few lines of code demonstrating the approach. Within months it was everywhere. Within two years it was the foundation everyone was building on.

The paper’s contribution was not the loop. The loop was ancient. The paper’s contribution was the observation that the loop now had something to run on, and the demonstration that when it did, it worked. Part 4 looks at what the field built once this observation sank in. If Part 3 is about the moment the loop finally worked, Part 4 is about everything that started being built to make it work better.

The Rise of Agents is an eight-part series. Next, Part 4: “The Harness.”

Robonaissance

Discussion about this post

Ready for more?