The Rise of Agents, Part 2: What Language Agents Inherited

Language agents did not replace reinforcement learning. They absorbed it.

Apr 24, 2026

In 2016, reinforcement learning looked like it might eat the world. AlphaGo beat Lee Sedol. Robotics papers were using RL to solve tasks that had defeated the field for decades. The narrative was: this is how agents will be built.

A decade later, the world is language-shaped. The agents that get funded and shipped are built around pretrained language models, not policy networks trained from pixels. The canonical Era 2 results have become historical artifacts.

The standard story of what happened in between is a story of replacement. Language models arrived, solved problems RL could not, and RL faded from agent research. This story is clean, easy to tell, and wrong.

Reinforcement learning did not fade from agent research. It moved. Every language agent in production today runs on a pretrained model that was then shaped by reinforcement learning. Every reasoning model that chains thoughts across inference time was trained with RL on the chains. Every coding agent that stays on task through long runs was post-trained with RL on the behaviors we call agentic. RL is not beneath language agents. It is inside them.

This article is about that inheritance. What RL did in Era 2, what it does in Era 3, and what the difference between those roles reveals about what language agents actually are.

What RL Did in Era 2

A brief reminder of the shape of RL before the language era.

In Era 2, the paradigm runs like this. You have an agent in some environment. The environment gives the agent observations and accepts actions. The agent’s job is to discover a policy, a mapping from observations to actions, that maximizes a reward signal over time. The agent starts knowing nothing. Through trial and error, guided by reward, it learns to act.

This is elegant and astonishingly general. The same algorithmic framework produces AlphaGo, robotic locomotion, and Atari game play. The agent does not need to be told how the world works. It discovers what works through interaction.

But the paradigm makes strong assumptions. The environment must give feedback. The reward signal must correlate with what you actually want. The state space, however large, must be tractable enough that trial and error can explore it. Most of all, the task must be specified at a level the algorithm can operate on. You do not ask a pure RL agent to book a flight. You ask it to minimize a loss over a trajectory in an explicit state space with explicit actions and explicit rewards.

In environments where these assumptions hold, RL is superhuman. In open worlds, where they do not, RL has nothing to start from. This is the second wall, and Part 1 covered it.

What matters for Part 2 is the next part of the story. The second wall did not fall because the field abandoned RL. The second wall fell because the field found a way to give RL something to start from.

The 2022 Synthesis

The breakthrough that defines Era 3 is not the language model alone. GPT-3 existed in 2020 and did not produce agents. It could complete text in impressive and often useful ways, but ask it to follow instructions reliably, adopt a persona, or refuse to generate harmful content, and it would do all of these unreliably or not at all. The behaviors that make current agents useful, following instructions, refusing certain requests, staying on task, were not latent in scale alone. Something else had to happen.

What defines Era 3 is the combination of a pretrained language model with reinforcement learning from human feedback.

InstructGPT, published by OpenAI in early 2022, is the template. Take a pretrained language model, which has absorbed the shape of human text without any particular behavioral goal. Collect comparisons from human raters: given two responses, which do you prefer. Train a reward model on those comparisons. Use RL to fine-tune the language model so its outputs score well on the reward model.

The result is a model that behaves differently from the raw language model. It follows instructions. It refuses certain requests. It adopts the voice of a helpful assistant. The base model had all of these behaviors latent in the distribution of human text. RL pulled a specific behavioral pattern out of the distribution and made it dominant.

This is the synthesis. Pretraining provides the representations: the shape of human knowledge, the structure of language, the implicit model of how humans reason. RL provides the shaping: which behaviors, among all those the model could exhibit, it should exhibit.

Neither alone produces what we recognize as a language agent. A raw pretrained model will complete text in whatever direction seems statistically plausible. It will not follow instructions reliably. It will not refuse harmful requests. It will not act like an assistant. A reinforcement-learned system without pretraining cannot reason about open worlds at all, because it has no representations to reason with. The combination produces something new.

This is the inheritance. Not the algorithm itself, although that matters. The capacity to shape behavior on top of pretrained representations. A way to move a model from “does whatever is statistically likely” to “does what we ask in the ways we want.”

What RL Does Inside Language Agents

RL’s role inside language agents is not a single job. It has at least three, and the third is newer than most industry observers realize.

The first is alignment. This is what InstructGPT did, what Constitutional AI does, what RLHF and its descendants do in every modern language model pipeline. The model is trained to prefer helpful, honest, harmless responses over the alternatives. Anthropic’s RLAIF uses AI-generated feedback in place of human labels, which lets the technique scale. Direct Preference Optimization skips the reward model and optimizes preferences directly. These are variations on the same move: shape a language model’s behavior toward preferred outputs using learned or collected preferences.

The second is reasoning. In late 2024, OpenAI released o1, a language model trained to produce extended chains of thought before producing answers. DeepSeek-R1 followed in January 2025 and showed the technique could be reproduced in open weights. The DeepSeek team named their variant Reinforcement Learning with Verifiable Rewards, or RLVR. Instead of training against human preferences, RLVR trains against automatically checkable signals: did the math answer come out right, did the code compile and pass tests. The reward is cheap and accurate, which means the training can run at scale.

The result is a new category of model, sometimes called a large reasoning model. The architecture is the same as the language models the reasoning models are built from. The training recipe differs. A base model is exposed to verifiable problems, generates multiple reasoning traces, and is reinforced for traces that arrive at correct answers. Over enough training, the model develops what the DeepSeek paper calls emergent reasoning patterns: self-reflection, verification, dynamic strategy adaptation. These are not hand-coded. They fall out of rewarding correct final answers on problems hard enough that naive approaches do not suffice.

Chain-of-thought prompting asks a model to reason step by step. Chain-of-thought training teaches a model that reasoning step by step pays off. The difference is the difference between a hint and a habit. A prompted model can produce chain-of-thought output, but whether it actually reasons through the chain or hallucinates a plausible-looking one depends on luck. A trained model has been shaped, over thousands of RL steps, to treat extended reasoning as the default approach to hard problems. The reasoning is not always correct. But it is no longer optional.

The third is agentic behavior. Coding agents, web-browsing agents, tool-using agents. All of them are post-trained to exhibit the behaviors we call agentic. Stay on task. Use tools correctly. Recover from errors. Maintain goals across steps. Each of these is a behavior that RL-style optimization against a carefully chosen reward can produce, and which a pretrained model alone will not reliably exhibit.

This is visible in specific cases. Claude Code and similar coding agents show behaviors the underlying language models do not exhibit out of the base. They invoke tools in a specific call format. They wait for tool results before continuing. They interpret error messages and adjust course. They run tests and use the outputs to decide what to do next. These behaviors sit on top of the base model’s knowledge of code, but they are not automatic from that knowledge. They are trained in. The specific way a frontier coding agent uses its tools, the exact shape of its correction loop, the cadence of its status updates: all of this is the product of post-training choices that differ from lab to lab.

There are other roles. Reward models are used as filters in inference. Safety training leans on preference data. Fine-tuning for specific industry use cases often blurs into RL territory. The pattern across all of these is the same. Pretraining built a base of capability. RL shapes the base into something behaviorally specific.

The Transformation

What changed between Era 2 and Era 3 is not whether RL is used. It is what RL is applied to.

In Era 2, RL was applied to a blank agent. Start from nothing. Learn a policy from scratch. The agent’s entire competence came from the RL process. This is why Era 2 worked in closed environments and failed in open ones. There was no prior knowledge to start from.

In Era 3, RL is applied to a pretrained model. The model already has competence. It already has representations of the world. It already has implicit models of reasoning, planning, and language. RL does not build the competence. It shapes it.

This sounds like a technical detail. It is actually the whole story.

Consider the same RL algorithm applied in each setting. Proximal Policy Optimization, the standard algorithm used for RLHF, is also a standard algorithm used in Era 2 robotics and game-playing RL. The algorithm is the same. The difference is what it operates on. Applied to a neural network starting from random weights, PPO can learn to play Atari if the environment cooperates, and nothing more. Applied to a pretrained language model, the same algorithm can turn a text completer into an instruction follower, an answer generator into a reasoner, a language model into an agent. The algorithm did not gain new powers. The substrate did.

A technique that fails in open worlds because it has no prior knowledge becomes powerful in open worlds when you give it prior knowledge. The prior knowledge comes from pretraining. The shaping comes from RL. Neither alone produces today’s agents. Together they produce agents that exhibit behaviors neither could produce on its own.

This is why the second wall fell when it did. Not because RL was replaced. Because RL acquired a foundation it had never had before: a pretrained language model with open-world competence already inside it.

What This Tells Us About Language Agents

If RL is inside language agents, not beneath, not beside, then several things follow.

First, language agents are composite systems, not monolithic ones. When an agent does something surprising, the cause could sit in the pretrained weights, in the RL-shaped behaviors, in the prompt, in the harness, or in the interaction between all of these. Debugging requires distinguishing which layer produced the behavior. This is part of why language agent behavior is famously hard to reason about. The layers interact, and they interact in ways no layer alone predicts.

Second, the capabilities of language agents are not the capabilities of pretrained language models. Pretraining provides the raw material. RL turns the raw material into an agent. When industry observers marvel at what modern agents can do, they are marveling at something produced by a specific post-training process. Different RL choices produce different agents from the same base model. Claude’s persona is the product of Anthropic’s post-training. GPT’s persona is the product of OpenAI’s. The base models differ less than the personas suggest.

Third, the bottlenecks of language agents are partly RL bottlenecks. Getting an agent to follow complex instructions, to refuse certain requests, to maintain specific values, to reason in specific patterns. All of these are RL-shaped. They improve when post-training improves. When a lab claims a new model is better at some agentic task, the improvement is often an RL improvement, not a pretraining one. The base models have converged. The post-training is where the differentiation happens now.

Fourth, the limits of language agents are partly RL limits. What RL can shape is what we can specify a reward for. Alignment works because “helpful, honest, harmless” can be approximated by human preference data. Reasoning works because correctness can be checked automatically on math and code. Agentic behavior works because tool-use success has measurable outcomes. What RL cannot shape well is anything where we cannot construct a reward signal. This becomes important later in the series.

The External Inheritance

Part 3 turns to the other inheritance of the language agent era: the ReAct loop. If Part 2 is about what RL gave language agents internally, Part 3 is about what ReAct gave them externally. The capacity to interleave reasoning with action, observe results, adjust course. The loop that makes language agents visible as agents rather than text generators.

Deliberate, act, observe, deliberate again. This structure predates language models by decades. It sat in symbolic agent architectures, in BDI systems, in robotics control stacks. What changed in 2022 was what the loop ran on. Pretraining plus RL plus the ReAct loop is the technical stack of every Era 3 agent. Part 3 looks at what made the loop work where earlier versions of it had not.

The Shape of the Thing

Language agents are not large language models with tools bolted on. They are pretrained models shaped by reinforcement learning, running in a reasoning loop, embedded in a harness. Each layer matters. Missing any one of them, the agent does not exist in the form we have come to know it.

Part 1 said language agents broke the second wall by inheriting the open-world competence of models trained on almost everything humans have written. That is true, but incomplete. The open-world competence comes from pretraining. The agentic character comes from RL. The wall fell to a combination, not a component.

The intention gap, at the top of the diagram, may or may not yield to better combinations. That is a question for later in the series. For now, it is enough to see that the combination is what matters. RL in agent research did not die. It moved inside. And the inside is where the interesting structure has been hiding.

The Rise of Agents is an eight-part series. Next, Part 3: “The ReAct Moment.”

Robonaissance

Discussion about this post

Ready for more?