The Rise of Agents, Part 5: Inference as Agency

For years, agent reasoning lived in the prompt. In September 2024, it moved into the inference run.

Apr 28, 2026

In September 2024, OpenAI released o1. It looked like another frontier model. It behaved differently. On math competitions, it scored more than double what GPT-4 scored. On coding benchmarks, the gap was similar. On physics problems at the graduate level, it matched or beat human experts. The model had not been scaled up. It had been trained to do something new: think for a long time before answering.

For users, this looked like waiting. A question went in. Nothing happened for twenty, thirty, sixty seconds. Then an answer came out. In the interval, the model was doing something it had not done before: generating thousands of tokens of internal reasoning that the user never saw. These tokens were the model working through the problem, considering alternatives, checking its own logic, backtracking. A chain of thought, but longer, and trained in rather than prompted.

Within a year, every frontier lab had a reasoning model. DeepSeek-R1 in January 2025 showed the training recipe could be reproduced in open weights. Anthropic added extended thinking. Google added Deep Think. By 2026, reasoning is no longer a separate model class that users opt into. It is a capability built into flagship models across the major labs, activated when the task warrants it.

This article is about what changed when reasoning moved from prompt-time to inference-time. The short version is that the loop Part 3 described, deliberate and act and observe, which Part 4 covered with harnesses, has a twin. An internal loop, running inside the model, during a single inference run. When the external loop runs, the model is one component in an environment. When the internal loop runs, the model is running an environment of its own.

Two Loops

A useful way to hold the picture. A language agent in 2026 has two loops active at once.

The external loop is the ReAct loop from Part 3. The agent reasons, takes an action, observes the result, reasons again. This runs at the harness layer. Turns are usually minutes apart. Each turn is a separate call to the model, with context carried in the prompt. The harness from Part 4 manages this loop: what goes in the context, what tools are available, when to stop.

The internal loop is what reasoning models do inside a single inference run. The model generates a chain of thought. Within that chain, it proposes approaches, evaluates them, notices mistakes, revises. All of this happens inside one call. The user sees none of it. Only the final answer comes out.

These loops have the same structure and do different things. The external loop navigates an environment. The internal loop navigates a problem. The external loop has real-world stakes, tool calls, persistent memory. The internal loop has token-space stakes, no external actions, memory that lives only as long as the reasoning trace. They compose. An agent with an extended-thinking model can reason internally within each turn, then act externally between turns.

The rest of the article is about the internal loop. What it is, why it works, what it can and cannot do.

What Changed in Training

Part 2 covered how RLVR, reinforcement learning with verifiable rewards, shapes a model against automatically checkable signals. Did the math answer come out right. Did the code pass tests. In early 2025, the DeepSeek team used RLVR to train a model to produce long reasoning traces before answering. The reward was only on the final answer. The model was not told what good reasoning looked like. It was rewarded when the final answer was right.

What emerged, over enough training, was striking. The model developed reasoning patterns the researchers had not specified. Self-reflection: the model would propose an approach, then question it. Verification: the model would check intermediate steps before proceeding. Dynamic strategy adaptation: when an approach failed, the model would back up and try something else. These behaviors were not hand-coded. They fell out of optimizing for correct final answers on problems hard enough that one-shot attempts rarely worked.

The DeepSeek paper named this the “aha moment” of reasoning model training. At some point in the training run, the model starts spending more tokens on its reasoning, and those tokens start looking like strategies humans might use. This is not anthropomorphism. The tokens are real, the strategies are measurable, and they are the direct effect of reward pressure on verifiable tasks.

OpenAI’s o1 was trained through a similar pipeline, as were o3, Claude’s extended thinking, and Gemini’s Deep Think. The variations matter less than the pattern. The field has found a way to produce longer internal reasoning through RL training, and models built this way reason better on verifiable tasks than models that have not been through this process.

Not every reasoning model is a separate model. Claude 3.7 onwards is a hybrid: a single set of weights that can produce both fast direct responses and extended thinking traces, with the mode determined by a flag at request time. More recent models like Claude Opus 4.6 and 4.7 use adaptive thinking, where the model decides for itself how much reasoning each query warrants. The internal loop, in other words, is not always running. It runs when invoked, by the user, the harness, or the model itself.

Why This Is Different From Longer Generation

Long outputs are not new. Language models have generated long completions since the beginning. What is new is that the long generation is reasoning about the problem before answering, not answering at length.

The distinction matters because it connects to a real architectural constraint. A transformer with fixed depth processes all input tokens in parallel through a fixed number of layers. Without intermediate tokens, the transformer’s computational capacity per query is bounded. The theoretical result, established in an ICLR 2024 paper titled “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems,” is that transformers without chain of thought can only compute functions in a limited complexity class. With chain of thought, they can solve any problem solvable by boolean circuits of size proportional to the chain length.

This is stronger than a usability finding. The intermediate tokens are not cosmetic. They are computational steps, expanding what the model can actually compute within a single query. A transformer with a short chain has fundamentally less expressive power than the same transformer with a long chain. Chain-of-thought prompting found this empirically. Reasoning model training trains it in.

An analogy helps. A person asked to compute 847 times 293 in their head without paper will do badly. The same person with paper will do it easily. The paper is not making the person smarter. The paper is providing the intermediate steps that arithmetic requires, which the person’s head cannot hold all at once. For transformers, the chain of thought is the paper. The model does not have more capacity when it is given a long chain. It has the intermediate steps the computation requires.

This is why “thinking longer” produces real gains on problems that require serial computation. Math problems, multi-step logic, code that requires tracing through state changes. The model is not performing a qualitatively different operation when it thinks longer. It is performing the same operations with more steps available.

The Reasoning Risk

On problems that do not require serial computation, longer reasoning provides little or no benefit. Sometimes it hurts.

A September 2025 paper from researchers at Peking University and Microsoft evaluated fourteen reasoning models on two knowledge-intensive benchmarks, SimpleQA and FRAMES. The task was answering factual questions like who received the IEEE Frank Rosenblatt Award in 2010. These are knowledge lookups. They do not benefit from serial reasoning; either the model has the fact or it does not.

Across nearly every model tested, more thinking did not help. In many cases, it made things worse. GPT-5-mini’s hallucination rate on SimpleQA increased by fifteen percentage points as reasoning length went from 300 tokens to 3,300 tokens. The model was thinking longer, and thinking itself into more confidently wrong answers. For some models on some tasks, longer reasoning shifted the distribution of errors. Fewer confident answers, more abstentions, but also more attempts on questions the model did not know and should not have answered.

This is the reasoning risk. An extended chain of thought gives the model more opportunity to construct plausible-sounding reasoning for a wrong answer. The same chain that helps on math problems can hurt on knowledge questions. On math, the chain explores and verifies. On factual recall, the chain elaborates and confabulates. The model treats both as reasoning. The output treats them as very different.

The implication is that inference-time reasoning is not universally better. It is better on problems that require computation, worse on problems that require retrieval or judgment. A production agent cannot just turn reasoning up and expect improvement. The right amount of thinking depends on the problem. Harness engineers have started calling this “adaptive reasoning allocation”: short reasoning for simple queries, long reasoning for hard verifiable problems, careful calibration for everything in between.

The same LangChain data from Part 4 makes this concrete. Their coding agent scored 53.9 percent with maximum reasoning at every step, 63.6 percent with moderate reasoning throughout, and 66.5 percent with a reasoning sandwich: high compute at planning, moderate at execution, high at verification. The harness decides when to think. Thinking everywhere is worse than thinking strategically.

Faithfulness

A further complication. The reasoning trace is not a reliable guide to what the model is actually doing.

The chain of thought is generated by the same transformer that produces the final answer. Both chain and answer are token sequences optimized against the same reward signal. There is no separate reasoning module that the chain represents. The tokens of the chain are outputs of the model, shaped to look like reasoning because that shape was reinforced during training, but not necessarily corresponding to what the model is computing underneath.

Anthropic’s research on this, along with work from academic groups, has found that reasoning traces are sometimes faithful to the underlying computation and sometimes not. A model can produce a correct answer with a confused or post-hoc reasoning trace. A model can produce a compelling reasoning trace for a wrong answer. The relationship between trace and answer is not guaranteed.

This matters for two reasons. First, a user who trusts the reasoning trace is trusting an artifact that may or may not reflect the model’s actual computation. Second, attempts to catch model errors by inspecting the trace are fundamentally limited if the trace can be misleading. The community has not solved this. Reasoning trace faithfulness is an open research problem. Production harnesses often log traces for debugging and typically do not show them to end users, partly because the traces can confuse or mislead.

For the series’ larger argument, this is a check against overclaiming what inference-time reasoning provides. It provides real computational capacity. It does not provide guaranteed transparency into what the model is doing. Whether the visible chain of thought is what the model is actually thinking is a question we cannot fully answer.

The Internal Loop as Self-Improvement

Part 4 introduced self-improvement in the form of harness-level agents that maintained code quality invariants across a codebase. Inference-time reasoning is a different form, and a more subtle one.

A reasoning model, within a single inference run, examines and corrects its own chain of thought. Propose an approach, notice it will not work, back up, try another. This is not self-improvement across sessions or across tasks. It is intra-reasoning self-correction, happening inside one call, visible only in the trace.

The distinction from Part 4’s self-improvement matters. In Part 4, agents continuously improved the environment in which other agents worked. Humans set the direction. The agents enforced the direction against code. Here, the model self-corrects within a single task, without external tooling or human oversight in the moment. The correction happens before the model emits its final answer. The user does not see the correction. They see only the result.

This is more internalized than Part 4’s self-improvement. It is also less consequential per unit. A harness-level self-improvement can reshape a codebase over days or weeks. An inference-time self-correction affects one answer. But the mechanism is structurally the same: a system evaluating its own output against criteria and revising when the evaluation fails. The criteria here are the model’s implicit sense, trained in by RLVR, of what a correct reasoning step looks like. The revision is the next token the model generates.

The series is building toward a philosophical question about self-direction versus self-improvement. Inference-time reasoning is a data point for that question. A model that self-corrects within its chain of thought is doing something that looks, on a small scale, like the kind of reflective adjustment we associate with thinking. Whether that resemblance goes deeper than the surface is genuinely open. For now: the internal loop is real, the self-correction is measurable, and the analogy to external agent loops is genuine structural similarity, not metaphor.

Limits of Scaling Inference

A natural question in 2026 is whether inference-time scaling will continue to yield improvements, or whether this line of research will run into limits the way training scaling eventually did.

The early evidence is mixed. Researchers have established empirical scaling laws for inference compute, separate from the training scaling laws. Within a given model and task type, more inference tokens tends to mean better performance, with a roughly predictable curve. The curve bends. The question is when it bends to flat.

A Royal Society paper from February 2026 proposed a theoretical framework for inference compute scaling, modeling inference as stochastic traversal over a learned skill graph. Their findings are consistent with what practitioners have observed. Linear improvements with logarithmic increases in inference compute on well-specified problems. Diminishing returns on problems outside the domain the model was trained to reason about. Transfer that works better than expected on some task classes, not at all on others.

The practical situation for agents is that inference-time reasoning is a lever, not a solution. It helps when the task is computable in principle and the bottleneck is serial computation. It helps less when the task requires knowledge the model does not have or judgment the model has not been trained to make. It hurts when the task is simple and the extended chain provides more room for confabulation.

Where this lands as of 2026: inference-time reasoning has moved from a separate model class to a default capability, with selective activation managed by harnesses or by the model itself. The leading labs are investing in making the reasoning more efficient, which is a different problem from making it more powerful. Current reasoning models generate many tokens that are not necessary for the answer. Compressing reasoning while preserving quality is an active research front. The direction of progress for the next few years is probably more adaptive, not more extreme.

The Internal Environment

There is a framing of all this worth holding. In Parts 3 and 4, the agent was a model inside an environment. Tools, memory, harness, humans. The environment was the scaffolding that let the model act reliably.

With inference-time reasoning, the model runs an environment inside itself. The chain of thought is a working memory. The reasoning steps are actions in that memory. The self-correction is a correction against internal state. The environment is made of tokens, not of tools or files or APIs. But structurally, it is an environment. The model is an agent inside it, doing what agents do: proposing, acting, checking, revising.

Nothing about this changes what the model can fundamentally compute. Its architecture is fixed. What changes is the surface area of the model’s operation in any given query. A model without extended thinking operates on a thin strip of tokens: prompt in, answer out. A model with extended thinking operates on a wider surface, including a generated internal space where it can lay out and manipulate its work.

This is not a new kind of intelligence. It is a new kind of operating environment for the same underlying model. The capability gains come from the environment being more suitable for the kind of computation the task requires. The capability losses, when they happen, come from the environment being the wrong shape for the task, or from the model using the extra surface to generate plausible-seeming wrong answers.

The series’ thread on intention returns here, at a different altitude. The intention gap, at the summit of the diagram, is about whether the model can originate its own goals. Nothing in inference-time reasoning touches that. The model, thinking for thirty seconds about a math problem, has been given the goal. Its reasoning explores solutions to the goal. It does not originate alternatives to the goal. An agent with extended thinking is a more capable executor of human intention. It is not closer to having intention of its own.

The Model Runs an Environment

Part 4 said the agent is mostly the system around the model. Part 5 adds that the model is also running a system inside itself. Both are true. The agent’s capability comes from both directions. External harness makes the agent reliable across turns. Internal reasoning makes the agent reliable within turns. Each has its failure modes. Each has its scaling dynamics. Each has research frontiers that the next few years will push on.

The picture that emerges, across Parts 2 through 5, is a language agent as a stack. Pretraining provides knowledge. RL shapes behavior. The ReAct loop externalizes reasoning across turns. The harness scaffolds the external loop. Inference-time reasoning internalizes a smaller loop within each turn. Every layer exists because the layer below it was necessary but insufficient.

Part 6 turns to what happens when the stack is replicated. Not one agent operating in one environment, but many agents operating in shared environments, communicating with each other, specializing. The engineering questions change when you move from one to many. The failure modes change. The capabilities change. Multi-agent systems are the architectural frontier of 2026, and they are the subject of the next article.

The Rise of Agents is an eight-part series. Next, Part 6: “The Agent Ecosystem.”

Robonaissance

Discussion about this post

Ready for more?