The Rise of Agents, Part 7: Agent Meets World

Software agents act on text. Physical agents act on a world that does not accept undo.

May 02, 2026

An agent operating a browser can make a mistake, back up, and try again. The wrong click costs a few seconds. An agent operating a robot arm holding a glass does not have this luxury. The wrong motion costs the glass. The cost is not recoverable with a retry. The world does not ship with an undo button.

This article is about the frontier where software agents, the subject of Parts 3 through 6, begin to operate in physical environments. Two research programs have been heading toward this frontier from opposite directions. Foundation models for robotics, approaching from the agent side: vision-language-action models that extend software agent capabilities into continuous motor control. World models, approaching from the robotics side: systems that predict how the physical world evolves, giving agents something to plan against before they act. In 2026, these two programs are starting to meet. Where they meet is what the next phase of agent engineering will be built on.

The specific question is what changes when the environment stops being text and starts being physics, and how the agent stack adapts. This is not a survey of robotics. It is the engineering question that the series has been building toward.

What Embodiment Adds

A useful way to see the scope of the change. Everything an agent in Parts 3 through 6 has done has happened in environments that tolerate its failure modes. A tool call that errors returns an error. A hallucinated fact can be checked against a search result. A drifted plan can be caught by an evaluator before execution. All of these recoveries happen because the environment is patient. It waits while the agent thinks. It accepts retries. It allows actions to be undone.

The physical world is different in every one of these respects. It does not wait. Objects fall at the speed gravity dictates. The agent’s window to act is finite and set by the environment, not the agent. Actions are often irreversible in ways that matter: a glass that breaks is not repaired by a second attempt. Observations are noisy and partial: the camera sees one angle, proprioception gives approximate joint states, the world outside the sensor’s cone is inferred rather than read. And small errors compound in ways that software errors rarely do. A centimeter of misjudgment at the gripper propagates to a task failure two seconds later.

This is not a new observation. Moravec’s paradox named it in the 1980s. Tasks that feel easy for humans, like walking across a room or picking up a cup, are structurally harder for machines than tasks that feel hard, like playing chess. The easy tasks are easy for humans because they rest on billions of years of evolved perception and motor control. The hard tasks are hard for humans because symbolic reasoning is recent and unoptimized. When machines do only the recent part well, the old part becomes the bottleneck.

Software agents solved the recent part. Reasoning, planning, tool use, communication. The physical frontier is where the old part returns, and the question for agent engineering is how to extend the stack built in Parts 2 through 6 into a domain the stack was not originally designed for.

The Agent Side: Vision-Language-Action

In late 2024, a startup called Physical Intelligence released a robot foundation model that, fifty times per second, produced seven-degree-of-freedom joint velocities a robot acted on. By late 2025, the company had raised $600 million in Series B funding led by CapitalG, bringing total funding near the billion-dollar mark.

The model is called π0. It is the clearest public example of one approach to extending the language agent stack into continuous motor control. The architecture is recognizably the language agent architecture from Parts 2 and 5, with one important modification. Where language agents output tokens that are interpreted as text or tool calls, π0 outputs continuous vectors that are interpreted as motor commands. The model inherits the semantic knowledge of its vision-language backbone: what a shirt is, how a kitchen is organized, what “fold this” means. It learns the rest by doing.

By 2026, π0 had progressed to π0.5, then π0.6. π0.5 added open-world generalization, enabling the same model to clean up an unfamiliar kitchen or fold unfamiliar laundry. π0.6 added RECAP, a training approach that mixes demonstration, correction, and reinforcement learning, doubling throughput on manipulation tasks and reducing failure rates over extended operation.

Physical Intelligence is not alone. Google DeepMind released Gemini Robotics in March 2025 and Gemini Robotics 1.5 in September 2025, with an architectural innovation the paper calls “Thinking Before Acting”: internal natural language reasoning that the model produces before emitting motor commands. This is the inference-time reasoning from Part 5 reaching into the physical domain. The agent thinks through the task in natural language, then acts. Ant Group’s LingBot-VLA, released in January 2026, demonstrated this approach at industrial scale, training on twenty thousand hours of real dual-arm robot data across nine configurations. NVIDIA’s GR00T family, Tesla’s Optimus models, Figure 02, 1X, Unitree, and others are all converging on vision-language-action as the dominant architecture for humanoid and dexterous manipulation.

The structural point across all of these is that the language agent’s stack, pretraining plus post-training plus inference-time reasoning plus harness scaffolding, generalizes to robot control. What it needs to add is the continuous action head, the training data of embodied trajectories, and the real-time inference constraints of physical control. Those are substantial engineering problems, but they are extensions of an existing stack, not a different stack. The agent side is reaching into the physical world with tools that were built for software.

The Robotics Side: World Models

Meta’s V-JEPA 2 was trained on a million hours of internet video. No robot data at all. After fine-tuning on sixty-two hours of robot trajectories from an open-source dataset, it could plan novel pick-and-place actions on Franka arms in two different labs, neither of which appeared in its training data.

This is the second direction reaching toward physical agency. Where vision-language-action models reach from the agent side, world models reach from the robotics side, learning to predict how scenes evolve before the agent acts in them.

A classic failure mode of even the best vision-language-action models is that they can react but not anticipate. The model sees the current state, decides on an action, acts, then sees the next state. This is the same ReAct loop from Part 3, running on physical inputs and outputs. It works when the task can be accomplished through immediate reaction. It fails when the task requires anticipation: knowing that placing a cup near the edge of a table will cause it to fall, knowing that grasping an object in a particular way will prevent it from being handed over, knowing that moving too fast through this door will hit the frame.

What the robot needs is a model of the world that predicts what will happen before the action happens.

V-JEPA 2-AC, the action-conditioned version, is what the team deployed zero-shot on those Franka arms. The robots ran the same model weights and could pick and place objects using model-predictive control: sample candidate actions, predict their consequences through the world model, pick the action sequence whose predicted future best matches the goal. The robot never trained on those labs. It planned through its predictive world model instead.

NVIDIA has been building the infrastructure side of this direction with Cosmos, a family of world foundation models for physical AI. Cosmos-Predict2.5, released in late 2025 and updated in early 2026, was trained on 200 million curated video clips and generates predicted future world states from text, image, or video prompts. Cosmos-Reason2 adds physical common sense reasoning with chain-of-thought grounding in embodied decision making. Cosmos-Transfer2.5 does sim-to-real and real-to-real world translation for training data generation. The family is positioned as infrastructure that robot developers can compose with their own VLA models to add prediction, synthetic data generation, and policy evaluation without having to build each capability from scratch.

Robonaissance has covered world model research in depth in the parallel "Roads to a Universal World Model" series. The key observation here is that world models are not decorative for robotics. They are the capability that lets an embodied agent act on something other than its immediate sensory state. Without a world model, the agent reacts. With a world model, the agent imagines.

Where the Two Sides Meet

Vision-language-action models can decide. World models can imagine. Neither alone is enough.

A VLA can interpret a command and produce an action, but it cannot reliably anticipate consequences that require longer-horizon prediction than the training distribution covers. A world model can predict how a scene evolves, but it does not on its own decide what to do about the prediction.

The emerging pattern in 2026 is to compose them. The VLA is the decision-making surface, generating candidate actions conditioned on the task. The world model is the imagination surface, simulating those candidates forward to evaluate consequences. A controller selects the candidate whose predicted outcome best matches the goal. This is model-predictive control, an idea that has been in robotics for decades, now running with learned components at scales that were not previously possible.

NVIDIA’s Cosmos Policy, introduced in early 2026, is one implementation of this composition. The system post-trains the Cosmos Predict-2 world foundation model for manipulation, producing a policy that generates actions using the world model’s learned dynamics. V-JEPA 2-AC is another. Physical Intelligence’s pi line has been incorporating predictive elements into its flow matching architecture, though the company has not released a pure world-model-plus-VLA composition publicly.

The convergence is real, and it is happening fast. It is also not complete. The research frontier is working out how to train these compositions jointly rather than as separate components, how to let the VLA and the world model share representations rather than hand off between them, how to bound the compute cost of planning against learned dynamics at the frequencies physical control requires. A reasonable forecast is that the architectures winning in 2027 and 2028 will look unified rather than composed, but the intellectual direction is clear from where we sit in 2026.

Grounding: Language, Reasoning, Physics

A grounding thread runs across several earlier articles. The ReAct loop worked because language models arrived with linguistic grounding from training data. Inference-time reasoning changes the grounding question, because a long reasoning trace can be confidently wrong in new ways when the model has no external check on its chain of thought.

Physical grounding is the third axis. A language agent knows about the world because humans have written about it. Its grounding is indirect: symbols standing in for things, patterns standing in for events, descriptions standing in for experiences. A physically grounded agent knows about the world because it acts in it. Its grounding is direct: forces, frictions, collisions, rotations. These are not the same kind of knowing.

The ambition of embodied foundation models is to bridge the two. A VLA trained on robot trajectories learns the physics of its effectors and end-effectors by doing. A world model trained on video learns the dynamics of scenes by prediction. Neither has the full breadth of linguistic knowledge about the world that a pretrained vision-language model brings. Combining them yields an agent whose grounding is partly linguistic, partly predictive, partly direct. This is a richer substrate than any one source provides, but it is not, yet, the grounding a physically competent human agent has. A three-year-old knows things about the world that no current model knows, and the gap is not obviously bridgeable just by scaling any of the current approaches.

This matters for the series thesis. The intention gap is partly a grounding question. An agent that cannot fully ground its knowledge in the physical world cannot have the kind of goals that arise from being a body in a world. Language agents operate on text. VLAs operate on a narrow slice of physical interaction. Neither has the grounding that full embodiment would require. Whether that gap closes is genuinely open.

Trust When Stakes Are Physical

The trust calibration thread from Part 4 and Part 6 takes a different shape in the physical domain.

A software agent’s mistake is usually recoverable. A commit can be reverted. A booking can be canceled. A hallucination in a research report can be caught in editing. The harness engineering that wraps production software agents is expensive partly because it is built to catch these mistakes before they propagate, but the underlying mistakes are not, typically, final.

Physical mistakes are sometimes final. A robot that drops the glass breaks the glass. A robot that collides with a person injures the person. A robot that misidentifies a fragile object as a sturdy one and applies too much force crushes it. There is no retry.

This changes trust calibration in specific ways. The threshold for human oversight drops. The surface area of consequence grows. The requirement for fail-safe behavior becomes non-optional: a robot that does not know what to do should stop, not improvise. The engineering around this looks different from the engineering around software harnesses. Hardware-level safety interlocks. Conservative motion planners that trade off capability for predictability. Explicit uncertainty estimation where a model can decline to act because its confidence is below threshold. None of this is optional in physical deployment, and all of it changes the composition of the harness from the software versions covered in Part 4.

The industry is just beginning to work out what multi-agent patterns look like in physical systems where one or more of the agents is embodied. A planner agent that tasks an embodied executor, and an evaluator that checks the executor’s outputs. A failure mode in the executor is a physical failure, not a software rollback. The delegation contracts that Part 6 introduced, the audit trails, the guardrail services, all take on different meaning when the thing being delegated is action on objects with mass and momentum. A trusted robot is not the same category of object as a trusted coding agent, and the trust engineering for each is a separate discipline.

Self-Improvement at Physical Stakes

A software agent that modifies its own prompts based on past performance is cheap to roll back. A robot that modifies its own motion primitives based on past performance is not.

Earlier articles in the series traced two forms of agent self-improvement. OpenAI’s Codex team deployed agents that continuously refactored the codebase against principles humans had set. Reasoning models within a single inference run examined and corrected their own chains of thought. Both were forms of self-improvement where the direction was externally set and the improvement operated against criteria the system did not choose.

Physical embodiment raises the stakes of self-improvement qualitatively.

Consider a robot authorized to modify its own motion primitives based on what has worked in past operations. In software, the equivalent is cheap to roll back and low-consequence when wrong. In a physical robot, a self-modified motion primitive that reduces success rate or increases collision probability can cause damage before the modification is detected. The feedback loop that makes software self-improvement fast and cheap makes physical self-improvement fast and potentially dangerous.

This is not a reason to prevent physical self-improvement. It is a reason to engineer it differently. Physical self-improvement systems that exist already include agents that tune grasping strategies based on success rate, agents that update their model of an environment as they operate in it, agents that refine their motion primitives through reinforcement learning against observed outcomes. All of these are useful and actively deployed. All of them sit within careful safety envelopes that bound what kinds of changes the system can make without human review. The engineering of those envelopes is a research area of its own, and the stakes around getting it right are higher than the equivalent software engineering stakes.

The self-improvement thread takes a particular shape here. Self-improvement at the harness layer is about tooling. Self-improvement within reasoning is about thought. Self-improvement at the physical layer is about the substrate of action itself. Whether any of these, taken together, are steps toward something different from execution is genuinely open. For now: the self-improvement in physical systems is real, the safety envelopes around it are still being engineered, and the trajectory is toward more self-improvement over time with correspondingly more sophisticated safety engineering around it.

At the Edge of Era 3

The Three Walls diagram from Part 1 placed physical embodiment at the boundary of Era 3, where the language agent platform meets the physical environment it did not originally inhabit. The earlier articles operated squarely inside that platform. The physical world is its edge. The question of how the agent stack extends into the physical world is a live engineering question, not a settled one. The components at play here, VLAs, world models, their emerging compositions, physical-scale trust and self-improvement engineering, are the pieces currently being developed. None of them is fully mature. All of them are moving fast.

What is interesting about this frontier is how much of the agent stack carries forward. The ReAct loop generalizes. The harness engineering generalizes with modifications. The inference-time reasoning generalizes with latency budgets that software harnesses did not need to manage. The multi-agent patterns from Part 6 generalize into physical settings where specialization between a planner, an embodied executor, and an evaluator is even more pronounced than in software. Much of what makes a good embodied agent in 2026 is what made a good software agent in 2024, adapted for the constraints physical environments impose.

What does not generalize easily is the grounding. Language agents succeed partly because their environment is made of the same substrate they were trained on: text. Physical agents do not have this luxury. Their environment is made of physics, and the training data that grounds them in physics, whether from real robot trajectories or from video of the world evolving, is orders of magnitude more expensive to collect than text. This is the bottleneck that will likely determine how fast embodied agents close the capability gap with their software counterparts. Compute is not the constraint. Data is.

Above all of this is the summit on the Three Walls diagram. Intention. The engineering across seven articles has produced more and more capable agents in more and more environments. None of those engineering moves has addressed whether any of these agents have, or can have, their own intentions. The capability rises. The summit does not move.

The Rise of Agents is an eight-part series. Next, Part 8: “The Open Frontiers.”

Robonaissance

Discussion about this post

Ready for more?