Robonaissance

Inside China’s Machine: The Platform War

Hugo — Wed, 29 Apr 2026 13:29:16 GMT

The enterprise software stack in the United States is being reorganized around AI agents. Salesforce sells Agentforce. Microsoft embeds Copilot into every surface of its productivity suite. Google positions Gemini as an agentic layer across Workspace and Cloud. The competitive question is which enterprise vendor owns the layer where AI actually performs tasks.

In China, the same competition is happening with different participants, different distribution, and different stakes. The four companies competing are not enterprise vendors. They are Tencent, Alibaba, Baidu, and ByteDance. Their agent platforms reach consumers through WeChat, Alipay, Ernie Bot, and Doubao. Their enterprise offerings run on their own cloud infrastructure. Their models are owned, not licensed. Every one of them launched a new agent product or a new agent-capable model between January and April 2026, and all four are building agents into consumer super-apps that reach hundreds of millions of daily users. There is no comparable four-company race in any other market.

This article maps the four platforms. Who owns the agent layer in China, what they are betting on, and where each is strong and weak.

No Clear Winner

Four dimensions determine which platform wins the agent layer: Distribution (how many users the agent reaches), Model (how capable the underlying AI is), Enterprise (how serious the commercial offering is), and Regulatory (how well the platform navigates China’s security and data governance environment). A fifth dimension, Financial Commitment, captures how much each company is investing in 2026.

Rating scale: strong, moderate, weak. This is a judgment based on reported data as of mid-April 2026. Numbers referenced below are documented in Sources.

No single company leads across all five. The pattern is instructive.

Tencent: The Super-App Play

Tencent’s bet is that AI agents become features of existing super-apps rather than standalone products. The distribution infrastructure is already built. WeChat has roughly 1.4 billion monthly active users and generates over $16 billion in annual app revenue through payments, mini-programs, e-commerce, content, and advertising. The thesis: attach an agent layer to the existing habit, and the agent inherits the scale without needing to acquire users.

The OpenClaw integration made the thesis concrete. On March 22, 2026, Tencent launched ClawBot, a WeChat plugin that appears as a contact within the messaging interface. Users send instructions the same way they message friends. In early April, Tencent Cloud launched ClawPro in public beta, an enterprise AI agent management platform that lets businesses deploy OpenClaw-based agents in ten minutes, with template selection, model switching, token-consumption tracking, and security compliance. During internal beta, ClawPro was adopted by more than 200 organizations across finance, government, and manufacturing.

As of April 2026, Tencent has launched more than ten agent products, including QClaw, WorkBuddy, ClawBot, ClawPro, CodeBuddy, the ADP agent development platform, the SkillHub skill community, and security tools branded as “Lobster Butler” (龙虾管家). Personal desktop tools, enterprise platforms, developer infrastructure, and consumer integrations, all within a single month.

Distribution: Strong. WeChat’s 1.4 billion MAU is the single most valuable distribution asset in Chinese consumer AI. No other platform comes close to its daily habit formation.

Model: Moderate. Tencent’s Hunyuan foundation model (406 billion parameters) lags the capability frontier in public benchmarks. Chief AI Scientist Yao Shunyu, a former OpenAI researcher, joined in December 2025 to close the gap. Hunyuan 3.0 is scheduled for April 2026. The Tencent consumer assistant Yuanbao grew twentyfold in daily active users between February and March 2025, largely because Tencent integrated DeepSeek models to compensate for Hunyuan’s weakness.

Enterprise: Moderate. ClawPro launched in public beta April 3 with 200+ organizations. The product is new. The enterprise cloud business is smaller than Alibaba’s. Tencent holds a smaller share of China’s AI cloud market than Alibaba’s 35.8 percent, although exact rankings shift with reporting source.

Regulatory: Strong. Tencent’s long operational experience with WeChat has built one of the most sophisticated compliance apparatuses in Chinese consumer tech. The “Lobster Butler” branding for security tooling is a tell: Tencent is deliberately positioning its OpenClaw integration as the most governance-ready option as Chinese regulators tighten agent rules.

Financial: Strong. Tencent spent 18 billion yuan on AI products in 2025 and announced plans to at least double that in 2026 (roughly $5 billion). President Martin Lau confirmed on the March earnings call that capital expenditure will rise, with compute earmarked for both internal training and external leasing through Tencent Cloud.

The risk. AI agents could displace super-app behavior. If users start conducting tasks through agents rather than through WeChat mini-programs, Tencent’s distribution advantage becomes a liability rather than a moat. The bet depends on agents remaining features of WeChat rather than replacing it.

Alibaba: The Token Hub

Alibaba’s bet is that tokens, the basic unit of AI processing, are the business. The company reorganized its entire AI operation in March 2026 into the Alibaba Token Hub (ATH), consolidating five previously separate AI units (Tongyi Laboratory, Qwen, Wukong enterprise AI, Alibaba Cloud AI infrastructure, and a research arm) under CEO Eddie Wu’s direct oversight. Wu’s framing in his announcement letter: “ATH is built around a single organising mission: create tokens, deliver tokens and apply tokens.”

The structural logic is coherent. China now processes 140 trillion tokens per day, up from approximately 100 billion at the start of 2024, according to China’s National Data Administration. NDA administrator Liu Liehong used the term 词元 (cí yuán) as the official Chinese translation for “token” at the China Development Forum in March 2026. The character 元 in 词元 means “element” or “base unit” in the NLP sense, but its homograph with 元 (the yuan, China’s currency) has led SCMP and Fortune to frame the choice as invoking token-as-currency. Liu himself called tokens “the settlement unit linking technological supply with commercial demand.” If tokens are the unit of AI economic activity, the company that creates, delivers, and applies the most tokens captures the most value.

Alibaba has three things competitors lack. First, Qwen. Qwen3 is consistently ranked among the world’s top open-source large language models (the AI systems that power modern chatbots and agents) and reaches 300 million monthly active users across Alibaba’s consumer ecosystem (Taobao, Tmall, Alipay, Amap, Fliggy). Second, the Taobao/Tmall/Alipay distribution: a week of 2026 Chinese New Year promotion delivered 140 million first-time AI shopping experiences. Third, the cloud business: Alibaba Cloud holds roughly 35.8 percent of China’s AI cloud market, the largest share among Chinese providers.

The Wukong enterprise platform, launched March 2026, is the commercial vehicle. Wukong coordinates multiple agents handling complex business tasks (document editing, meeting transcription, workflow automation) within a single interface. The architecture is similar to what Tencent’s ClawPro offers, but with Alibaba’s cloud scale and enterprise customer base underneath.

Distribution: Strong. Qwen’s 300M MAU is behind Doubao’s 159M only by aggregation methodology. Counting unique users across Alibaba’s consumer ecosystem, the total reach is higher. Alipay’s 120 million AI-agent transactions in a single week in February 2026 is the clearest data point: Chinese consumers are already making autonomous purchases at scale, and Alipay is the rails.

Model: Strong. Qwen has consistently ranked among the top Chinese open-source models. Alibaba also benefits from its own in-house chip work: in April 2026, the company unveiled a new data center running entirely on its proprietary Zhenwu chips, reducing dependence on foreign compute.

Enterprise: Strong. Alibaba Cloud’s 35.8 percent AI cloud market share, Wukong’s multi-agent enterprise platform, and deep integration with DingTalk (Alibaba’s workplace communication app, comparable to Slack) make Alibaba the most enterprise-ready of the four.

Regulatory: Moderate. Alibaba has faced ongoing regulatory scrutiny since 2021, including the Ant Group restructuring and ongoing anti-monopoly enforcement. The 2024 relationship with regulators has improved, but Alibaba carries more regulatory risk than Tencent. Deployment of agents at consumer scale through Alipay requires navigating both financial services regulation and AI-specific rules.

Financial: Strong. Alibaba’s 2025 AI R&D spend reached 67 billion yuan (roughly $9.4 billion), the largest among Chinese tech companies. The Alibaba Token Hub restructuring consolidates budget and strategic authority. Capital commitment is clear and growing.

The risk. The Token Hub strategy depends on tokens remaining the primary unit of AI economic activity. If agent deployment shifts toward fixed-price subscriptions or outcome-based billing, the token-centric framing becomes less useful. Alibaba also faces the challenge of integrating five previously separate units under a new structure: organizational coherence is never a given.

Baidu: The Full-Stack Bet

Baidu’s bet is vertical integration. The company owns a foundation model (ERNIE 5.0, multimodal, 2.4 trillion parameters), its own AI chips (Kunlunxin M100 launching early 2026, M300 in 2027), its own cloud platform (Qianfan), its own consumer interface (Ernie Bot, branded domestically as Wenxiaoyan), and its own agent development platform (AgentBuilder). It also owns the largest autonomous driving business in China: Apollo Go has completed 17 million rides, runs 250,000 weekly rides fully driverless, and operates in 22 cities.

The thesis: in a compute-constrained environment, the platform that owns chips, models, cloud, and applications end-to-end captures the most margin. Baidu is replicating Google’s vertical integration strategy at smaller scale.

ERNIE 5.0, unveiled at Baidu World 2025 in November, is natively multimodal (text, images, audio, video trained jointly from scratch). Baidu’s own benchmarks claim ERNIE 5.0 is competitive with Gemini, GPT-5, and DeepSeek across language, audio, and visual tasks, though independent benchmarks show mixed results. The ERNIE agent products (GenFlow for general-purpose, Famou for self-evolving agents, Oreate for AI workspace) cover personal, enterprise, and developer tiers.

Baidu was also the most visible promoter of consumer OpenClaw adoption. Installation events at its Beijing headquarters drew hundreds of attendees, and its OpenClaw-based agent suite spans desktop software, cloud services, mobile tools, and smart home devices. The full-stack approach: every layer of the agent stack is a Baidu product.

Distribution: Moderate. Ernie Bot reached 200+ million users by April 2024 and continues to grow, but the distribution is weaker than Tencent’s or Alibaba’s. Baidu Search remains China’s largest search engine, which provides one distribution channel. Apollo Go provides another, but neither has the 1.4 billion user scale of WeChat or the 300 million-plus of Qwen and Taobao.

Model: Moderate. ERNIE 5.0 ranks competitively in Chinese benchmarks (ranked No. 1 in China and No. 8 globally on LMArena’s text benchmark, a community-run leaderboard where users compare model outputs head-to-head, as of January 2026) but lags Qwen3 and Doubao-Seed-2.0 in some comparisons. Chinese AI coverage consistently places ERNIE behind Qwen in open-source impact.

Enterprise: Moderate. Qianfan cloud is the smallest of the three major Chinese clouds (Alibaba, Tencent, Baidu by market share). AgentBuilder has strong developer adoption (50,000+ developers, 30,000+ agents by mid-2024) but commercial scale is limited compared to Alibaba Cloud’s enterprise customer base.

Regulatory: Moderate. Baidu’s position is neither advantaged nor disadvantaged. The autonomous driving business operates under specific regulatory frameworks that provide some advantages in city-government relationships. Ernie Bot was among the first Chinese chatbots to receive regulatory approval in 2023, which suggests established compliance processes.

Financial: Moderate. Baidu’s AI-powered business reached 43 percent of core revenue in Q4 2025, up from 26 percent a year earlier. Total AI spend is smaller than Alibaba’s or Tencent’s in absolute terms. The Kunlunxin chip investment is significant but concentrated: Baidu is spending on chip development while peers spend on cloud buildout.

The risk. Full-stack vertical integration requires sustained excellence at every layer. If ERNIE falls behind Qwen and Doubao in capability, if Kunlunxin chips underperform NVIDIA alternatives, if Apollo Go fails to reach profitability, Baidu’s strategy fragments. The full stack is a strength when it works and a weakness when one layer slips.

ByteDance: The Distribution Flood

ByteDance’s bet is quantitative scale. Doubao is China’s largest AI application by monthly active users (159 million). Daily token usage surged to 16.4 trillion as of mid-2025, a 137-fold increase since Doubao’s May 2024 debut. Volcano Engine, ByteDance’s enterprise cloud arm, commanded 46.4 percent of China’s public cloud large model service market as of mid-2025, according to IDC, more than Baidu AI Cloud and Alibaba Cloud combined in that specific segment. This metric measures model API consumption, a narrower slice than the broader AI cloud market where Alibaba leads.

The strategy: make tokens radically cheap, integrate AI into every ByteDance product surface (Douyin, Jimeng, Lark, TikTok internationally), and let volume compensate for margin. Doubao enterprise tokens launched in 2024 at 99.3 percent below the industry average. Volcano Engine’s 2024 revenue was over 12 billion yuan, targeting 25 billion yuan in 2025. The 2030 target: 100 billion yuan.

The agent strategy has three layers. First, Doubao itself serves as both consumer AI app and API-accessible model. Second, Coze (扣子) is ByteDance’s agent development platform, allowing developers to build applications that integrate with Doubao, Lark, and third-party tools. Coze Studio and Coze Loop were open-sourced in 2025, gaining over 10,000 GitHub stars in three days. Third, Coze Space is an agentic collaboration platform with general-purpose agents (which ByteDance internally describes as “inexperienced interns”) and Expert Agents for specialized domains like user research and financial analysis.

ArkClaw, released by Volcano Engine during the OpenClaw boom, is a browser-native OpenClaw variant that eliminates the need for local installation. The design choice is characteristic: ByteDance optimizes for the broadest possible user access rather than for depth of integration.

Distribution: Strong. Doubao’s 159M MAU, Douyin’s 700M+ daily active users in China, and TikTok’s global reach add up to the largest addressable user base among the four. ByteDance’s autonomous commerce capability (Doubao can open JD.com, Taobao, Pinduoduo, and Douyin Mall simultaneously, compare prices, and complete a purchase in under 30 seconds) operates at a speed and cross-platform scale that Western consumer agents have not yet demonstrated.

Model: Strong. Doubao-Seed-2.0 is positioned against GPT-5.2 and Gemini 3 Pro. ByteDance’s internal benchmarks show Doubao leading Chinese peers in instruction following and tool invocation (the capabilities that matter most for agents). Token pricing is the most aggressive in the industry.

Enterprise: Strong. Volcano Engine’s 46.4 percent market share in public cloud large language model services is the dominant Chinese position. Enterprise integration with Lark (domestically branded Feishu, ByteDance’s workplace productivity suite) provides a software entry point that Tencent and Alibaba struggle to match. 2024 revenue of 12+ billion yuan on track to more than double in 2025.

Regulatory: Moderate. ByteDance has faced sustained regulatory pressure from both the Chinese government (on Douyin content moderation) and the US government (on TikTok’s ownership structure). The dual-regulator exposure creates ongoing distraction and potential forced restructuring. The February 2026 suspension of ByteDance’s Seedance 2.0 feature that turns facial photos into personal voices, over concerns about misuse, illustrates the responsiveness to regulatory signals.

Financial: Strong. ByteDance’s revenue has been growing fastest among the four. Volcano Engine’s trajectory from 12 billion yuan (2024) to 100 billion yuan target (2030) represents roughly an eightfold capital commitment. ByteDance is privately held and does not disclose consolidated AI spend, but inferred capital commitment exceeds Baidu and is comparable to Tencent.

The risk. ByteDance’s agent platform is distribution-led rather than ecosystem-led. If consumer AI agent adoption plateaus (as the OpenClaw aftermath suggests it might), ByteDance’s 159M MAU advantage shrinks. The enterprise strategy through Volcano Engine requires sustained technical credibility that competes with Alibaba Cloud’s longer track record. And the US TikTok situation remains an ambient risk to global strategy.

Model Suppliers: The Second Tier

The Big Four own platforms. A second tier of Chinese AI labs supplies models that run on those platforms, or competes directly for enterprise deployment. These companies are not playing the platform game, but they shape the platform war by providing the frontier model capabilities that platforms either absorb or license.

Zhipu AI (智谱): Tsinghua-originated, roughly $2 billion valuation as of 2025, preparing an IPO. GLM-5 Turbo launched February 12, 2026, built specifically for OpenClaw integration. Stock surged 25+ percent on the announcement; market cap crossed HK$100 billion. Strong Chinese government and academic relationships. Competes for enterprise deployments against Alibaba’s Qwen.

MiniMax: Founder Yan Junjie. M2.5 coding model launched February 2026, positioned as production-grade tool rather than chatbot. Stock rose 20+ percent on announcement; market cap crossed HK$100 billion. Operates Talkie (international companion chatbot, ~$70M revenue in 2024). Strategic pivot: from foundation model training to application-layer products, reducing cost and accelerating time-to-market.

Moonshot AI: Kimi chatbot, 13+ million users. $3.3 billion valuation. Kimi K2.5 launched January 2026 with video generation and agentic capabilities. Backed by Alibaba and Tencent (Moonshot is the canonical example of platform companies investing in model suppliers rather than competing directly). Focus on long-context processing, positioned as complementary to platform offerings.

DeepSeek: The one non-platform Chinese AI company to have moved global markets. Provides open-source models that Chinese platforms (especially Tencent, which integrated DeepSeek into Yuanbao) use to supplement their own foundation models. Commercial structure less transparent than peers. Relationship to the platform war: infrastructure provider rather than competitor.

01.AI: Kai-Fu Lee’s venture. Less public presence in the agent race, but foundation model work continues.

The second-tier pattern is consistent. These companies either supply models to the Big Four (Moonshot, DeepSeek) or compete for adjacent segments (Zhipu’s enterprise, MiniMax’s coding) rather than trying to become platforms themselves. Becoming a platform in China requires distribution infrastructure that takes decades to build. The model suppliers are smart not to try.

What This Pattern Reveals

The Chinese agent platform war differs from the US agent war in three structural ways.

First, consumer-first rather than enterprise-first. US agents deploy through enterprise software (Salesforce Agentforce, Microsoft Copilot) and reach consumers through employer mandates. Chinese agents deploy through consumer super-apps (WeChat, Alipay, Douyin) and reach enterprises through the same platforms extended into business surfaces. The direction of travel is reversed. The implication: Chinese agent capabilities get tested against consumer behavior before being hardened for enterprise, while US capabilities get tested against enterprise requirements before being adapted for consumer.

Second, open-source at the foundation. Every major Chinese platform has adopted OpenClaw, open-sourced its own agent development tools (Coze Studio, Coze Loop), and committed to open-weights models (Qwen, ERNIE). The competitive moat is not the framework. It is the distribution, the enterprise tooling, and the integration into local workflows. US agent platforms compete at both layers: proprietary frameworks plus proprietary distribution.

Third, state-adjacent development. Chinese central government restrictions on OpenClaw use in state-owned enterprises and banks, Ministry of State Security security manuals, 15th Five-Year Plan targets of 10 trillion yuan in AI industry size by 2030, local government subsidies in Shenzhen and Wuxi, the National Data Administration’s designation of 词元 as the official term for token: all of this shapes how Chinese platforms build agent products. US platforms operate under regulatory pressure (California AI laws, EU AI Act) but not within an industrial policy framework. The Chinese platforms that navigate regulatory environment best (currently Tencent, with its security-first positioning) capture advantages that are not technical.

Three implications for readers of this series.

For the engineer: The platform choice for agent deployment in China is primarily a distribution decision, not a model decision. All four platforms offer comparable agent capability on paper. The differentiator is how many users your agent reaches, what data it can access, and what ecosystem it integrates into. Build on WeChat for consumer reach, on Alibaba Cloud for enterprise scale, on Volcano Engine for token economics, or on Baidu for integrated compute. The model underneath is largely interchangeable.

For the founder: The defensible positions are narrow. Building a better agent framework will not matter: the Big Four have absorbed OpenClaw and will absorb whatever comes next. Building a better consumer-facing agent will not matter: the Big Four have distribution advantages that no startup can overcome at scale. The opportunities are in vertical applications (specific industry workflows where platform companies under-invest), in model specialization (following Moonshot and Zhipu into long-context, coding, or domain-specific models), or in international markets where Chinese platforms face regulatory barriers.

For the investor: The Big Four’s market capitalization already prices in significant agent-related upside. The interesting opportunities are in the second tier. Zhipu and MiniMax crossing HK$100 billion on agent-model announcements is a preview: model suppliers that can demonstrate enterprise traction will outperform platform companies whose agent economics are diluted by free consumer offerings. Alibaba’s 35.8 percent AI cloud market share is a structural advantage that should compound. Tencent’s double-down on AI spend (36 billion yuan planned for 2026) is a commitment worth tracking against delivery.

The Platform Layer Is the Prize

In the 2010s, the dominant platform layer was cloud infrastructure. AWS, Azure, and Google Cloud split the American market. Alibaba Cloud, Tencent Cloud, and Baidu Cloud split the Chinese market. The dominant cloud provider captured the economics of every application built on top.

The late 2020s will likely be defined by a similar platform war, but at the agent layer rather than the cloud layer. The agent layer sits above the cloud: agents orchestrate tool calls, manage context, and deliver outcomes that enterprise and consumer applications consume. The company that owns the agent layer owns the economic returns from every application that runs agents.

In China, the agent layer is being claimed by four companies with different strategies, different strengths, and different exposure to regulatory risk. None of them has yet locked in dominance. As of mid-April 2026, Alibaba and ByteDance lead on technical and financial commitment, Tencent leads on distribution and regulatory positioning, and Baidu competes on full-stack vertical integration from a weaker distribution base.

The race will resolve over the next eighteen to thirty-six months. By then, the platform war will have produced a clear hierarchy, and the economics of Chinese AI will be reshaped accordingly.

Inside China’s Machine. China’s AI and robotics ecosystem, from the inside.

Sources

Platform company strategies and product launches: Reuters, CNBC, Bloomberg, South China Morning Post, The Next Web, Fortune, KrASIA, TMTPOST, BigGo Finance, IndexBox. Tencent ClawPro launch (April 3, 2026) via Tencent Cloud official announcement and SCMP. WeChat ClawBot (March 22, 2026) via Reuters and PYMNTS. Alibaba Token Hub restructuring via Fortune (April 2026). ByteDance Coze and Doubao details via TechNode, KrASIA, TMTPOST. Baidu ERNIE 5.0 and full-stack strategy via Baidu World 2025 keynote, InfoWorld, eWeek.

User metrics: Double V Consulting (Doubao 159M MAU, Qwen 300M MAU). Fortune (140 trillion tokens/day in China, up from 100B at start 2024). Chinese New Year AI shopping data (Alipay 120M transactions in a week of February 2026) from Double V.

Market share data: 2025 IDC data on public cloud LLM market (Volcano Engine 46.4% as of mid-2025). Alibaba 35.8% AI cloud share via The Next Web ClawPro coverage (April 2026) citing industry data. Note: market share rankings shift with methodology and segment definition. Alibaba, Tencent, and Baidu are the three largest Chinese cloud providers by different measures.

Financial commitments: Tencent 18 billion yuan AI spend 2025, planned to double in 2026, via Martin Lau statements and Tencent earnings. Alibaba 67 billion yuan R&D spend 2025 via Second Talent industry reporting. Volcano Engine revenue 12+ billion yuan (2024) and 25 billion yuan (2025 target) via TMTPOST. Baidu AI-powered business 43% of core revenue (Q4 2025) via Baidu earnings.

Model capability benchmarks: LMArena rankings for ERNIE 5.0 (No. 1 in China, No. 8 global on text benchmark, as of January 2026) via ERNIE Blog. Doubao-Seed-2.0 positioning against GPT-5.2 and Gemini 3 Pro via TechNode (February 2026). Zhipu GLM-5 and MiniMax M2.5 announcements (February 12, 2026) via BigGo Finance.

Model supplier data: Moonshot AI valuation ($3.3B) and Kimi user base via Second Talent. Zhipu AI ($2B+) and MiniMax stock surges via BigGo Finance. DeepSeek context via Fortune and CNBC.

Government and regulatory context: 词元 (cí yuán) term designation from National Data Administration administrator Liu Liehong’s speech at China Development Forum 2026 (March 23-24, 2026) via SCMP, Fortune, PANews, China Daily. 15th Five-Year Plan AI industry target of 10 trillion yuan by 2030 via Vision Times. MIIT security guidelines and MSS “Lobster Safety Farming Manual” (March 2026) via multiple Chinese state media sources.

Classification of data points: User counts and token volumes are Confirmed (company disclosure or state regulator data). Market share percentages are Estimated (third-party research, methodology varies). Financial commitments are Projected (announced plans rather than reported results). 2030 targets (ByteDance 100B yuan, China AI industry 10T yuan) are Projected.

The Rise of Agents, Part 5: Inference as Agency

Hugo — Tue, 28 Apr 2026 07:01:02 GMT

In September 2024, OpenAI released o1. It looked like another frontier model. It behaved differently. On math competitions, it scored more than double what GPT-4 scored. On coding benchmarks, the gap was similar. On physics problems at the graduate level, it matched or beat human experts. The model had not been scaled up. It had been trained to do something new: think for a long time before answering.

For users, this looked like waiting. A question went in. Nothing happened for twenty, thirty, sixty seconds. Then an answer came out. In the interval, the model was doing something it had not done before: generating thousands of tokens of internal reasoning that the user never saw. These tokens were the model working through the problem, considering alternatives, checking its own logic, backtracking. A chain of thought, but longer, and trained in rather than prompted.

Within a year, every frontier lab had a reasoning model. DeepSeek-R1 in January 2025 showed the training recipe could be reproduced in open weights. Anthropic added extended thinking. Google added Deep Think. By 2026, reasoning is no longer a separate model class that users opt into. It is a capability built into flagship models across the major labs, activated when the task warrants it.

This article is about what changed when reasoning moved from prompt-time to inference-time. The short version is that the loop Part 3 described, deliberate and act and observe, which Part 4 covered with harnesses, has a twin. An internal loop, running inside the model, during a single inference run. When the external loop runs, the model is one component in an environment. When the internal loop runs, the model is running an environment of its own.

Two Loops

A useful way to hold the picture. A language agent in 2026 has two loops active at once.

The external loop is the ReAct loop from Part 3. The agent reasons, takes an action, observes the result, reasons again. This runs at the harness layer. Turns are usually minutes apart. Each turn is a separate call to the model, with context carried in the prompt. The harness from Part 4 manages this loop: what goes in the context, what tools are available, when to stop.

The internal loop is what reasoning models do inside a single inference run. The model generates a chain of thought. Within that chain, it proposes approaches, evaluates them, notices mistakes, revises. All of this happens inside one call. The user sees none of it. Only the final answer comes out.

These loops have the same structure and do different things. The external loop navigates an environment. The internal loop navigates a problem. The external loop has real-world stakes, tool calls, persistent memory. The internal loop has token-space stakes, no external actions, memory that lives only as long as the reasoning trace. They compose. An agent with an extended-thinking model can reason internally within each turn, then act externally between turns.

The rest of the article is about the internal loop. What it is, why it works, what it can and cannot do.

What Changed in Training

Part 2 covered how RLVR, reinforcement learning with verifiable rewards, shapes a model against automatically checkable signals. Did the math answer come out right. Did the code pass tests. In early 2025, the DeepSeek team used RLVR to train a model to produce long reasoning traces before answering. The reward was only on the final answer. The model was not told what good reasoning looked like. It was rewarded when the final answer was right.

What emerged, over enough training, was striking. The model developed reasoning patterns the researchers had not specified. Self-reflection: the model would propose an approach, then question it. Verification: the model would check intermediate steps before proceeding. Dynamic strategy adaptation: when an approach failed, the model would back up and try something else. These behaviors were not hand-coded. They fell out of optimizing for correct final answers on problems hard enough that one-shot attempts rarely worked.

The DeepSeek paper named this the “aha moment” of reasoning model training. At some point in the training run, the model starts spending more tokens on its reasoning, and those tokens start looking like strategies humans might use. This is not anthropomorphism. The tokens are real, the strategies are measurable, and they are the direct effect of reward pressure on verifiable tasks.

OpenAI’s o1 was trained through a similar pipeline, as were o3, Claude’s extended thinking, and Gemini’s Deep Think. The variations matter less than the pattern. The field has found a way to produce longer internal reasoning through RL training, and models built this way reason better on verifiable tasks than models that have not been through this process.

Not every reasoning model is a separate model. Claude 3.7 onwards is a hybrid: a single set of weights that can produce both fast direct responses and extended thinking traces, with the mode determined by a flag at request time. More recent models like Claude Opus 4.6 and 4.7 use adaptive thinking, where the model decides for itself how much reasoning each query warrants. The internal loop, in other words, is not always running. It runs when invoked, by the user, the harness, or the model itself.

Why This Is Different From Longer Generation

Long outputs are not new. Language models have generated long completions since the beginning. What is new is that the long generation is reasoning about the problem before answering, not answering at length.

The distinction matters because it connects to a real architectural constraint. A transformer with fixed depth processes all input tokens in parallel through a fixed number of layers. Without intermediate tokens, the transformer’s computational capacity per query is bounded. The theoretical result, established in an ICLR 2024 paper titled “Chain of Thought Empowers Transformers to Solve Inherently Serial Problems,” is that transformers without chain of thought can only compute functions in a limited complexity class. With chain of thought, they can solve any problem solvable by boolean circuits of size proportional to the chain length.

This is stronger than a usability finding. The intermediate tokens are not cosmetic. They are computational steps, expanding what the model can actually compute within a single query. A transformer with a short chain has fundamentally less expressive power than the same transformer with a long chain. Chain-of-thought prompting found this empirically. Reasoning model training trains it in.

An analogy helps. A person asked to compute 847 times 293 in their head without paper will do badly. The same person with paper will do it easily. The paper is not making the person smarter. The paper is providing the intermediate steps that arithmetic requires, which the person’s head cannot hold all at once. For transformers, the chain of thought is the paper. The model does not have more capacity when it is given a long chain. It has the intermediate steps the computation requires.

This is why “thinking longer” produces real gains on problems that require serial computation. Math problems, multi-step logic, code that requires tracing through state changes. The model is not performing a qualitatively different operation when it thinks longer. It is performing the same operations with more steps available.

The Reasoning Risk

On problems that do not require serial computation, longer reasoning provides little or no benefit. Sometimes it hurts.

A September 2025 paper from researchers at Peking University and Microsoft evaluated fourteen reasoning models on two knowledge-intensive benchmarks, SimpleQA and FRAMES. The task was answering factual questions like who received the IEEE Frank Rosenblatt Award in 2010. These are knowledge lookups. They do not benefit from serial reasoning; either the model has the fact or it does not.

Across nearly every model tested, more thinking did not help. In many cases, it made things worse. GPT-5-mini’s hallucination rate on SimpleQA increased by fifteen percentage points as reasoning length went from 300 tokens to 3,300 tokens. The model was thinking longer, and thinking itself into more confidently wrong answers. For some models on some tasks, longer reasoning shifted the distribution of errors. Fewer confident answers, more abstentions, but also more attempts on questions the model did not know and should not have answered.

This is the reasoning risk. An extended chain of thought gives the model more opportunity to construct plausible-sounding reasoning for a wrong answer. The same chain that helps on math problems can hurt on knowledge questions. On math, the chain explores and verifies. On factual recall, the chain elaborates and confabulates. The model treats both as reasoning. The output treats them as very different.

The implication is that inference-time reasoning is not universally better. It is better on problems that require computation, worse on problems that require retrieval or judgment. A production agent cannot just turn reasoning up and expect improvement. The right amount of thinking depends on the problem. Harness engineers have started calling this “adaptive reasoning allocation”: short reasoning for simple queries, long reasoning for hard verifiable problems, careful calibration for everything in between.

The same LangChain data from Part 4 makes this concrete. Their coding agent scored 53.9 percent with maximum reasoning at every step, 63.6 percent with moderate reasoning throughout, and 66.5 percent with a reasoning sandwich: high compute at planning, moderate at execution, high at verification. The harness decides when to think. Thinking everywhere is worse than thinking strategically.

Faithfulness

A further complication. The reasoning trace is not a reliable guide to what the model is actually doing.

The chain of thought is generated by the same transformer that produces the final answer. Both chain and answer are token sequences optimized against the same reward signal. There is no separate reasoning module that the chain represents. The tokens of the chain are outputs of the model, shaped to look like reasoning because that shape was reinforced during training, but not necessarily corresponding to what the model is computing underneath.

Anthropic’s research on this, along with work from academic groups, has found that reasoning traces are sometimes faithful to the underlying computation and sometimes not. A model can produce a correct answer with a confused or post-hoc reasoning trace. A model can produce a compelling reasoning trace for a wrong answer. The relationship between trace and answer is not guaranteed.

This matters for two reasons. First, a user who trusts the reasoning trace is trusting an artifact that may or may not reflect the model’s actual computation. Second, attempts to catch model errors by inspecting the trace are fundamentally limited if the trace can be misleading. The community has not solved this. Reasoning trace faithfulness is an open research problem. Production harnesses often log traces for debugging and typically do not show them to end users, partly because the traces can confuse or mislead.

For the series’ larger argument, this is a check against overclaiming what inference-time reasoning provides. It provides real computational capacity. It does not provide guaranteed transparency into what the model is doing. Whether the visible chain of thought is what the model is actually thinking is a question we cannot fully answer.

The Internal Loop as Self-Improvement

Part 4 introduced self-improvement in the form of harness-level agents that maintained code quality invariants across a codebase. Inference-time reasoning is a different form, and a more subtle one.

A reasoning model, within a single inference run, examines and corrects its own chain of thought. Propose an approach, notice it will not work, back up, try another. This is not self-improvement across sessions or across tasks. It is intra-reasoning self-correction, happening inside one call, visible only in the trace.

The distinction from Part 4’s self-improvement matters. In Part 4, agents continuously improved the environment in which other agents worked. Humans set the direction. The agents enforced the direction against code. Here, the model self-corrects within a single task, without external tooling or human oversight in the moment. The correction happens before the model emits its final answer. The user does not see the correction. They see only the result.

This is more internalized than Part 4’s self-improvement. It is also less consequential per unit. A harness-level self-improvement can reshape a codebase over days or weeks. An inference-time self-correction affects one answer. But the mechanism is structurally the same: a system evaluating its own output against criteria and revising when the evaluation fails. The criteria here are the model’s implicit sense, trained in by RLVR, of what a correct reasoning step looks like. The revision is the next token the model generates.

The series is building toward a philosophical question about self-direction versus self-improvement. Inference-time reasoning is a data point for that question. A model that self-corrects within its chain of thought is doing something that looks, on a small scale, like the kind of reflective adjustment we associate with thinking. Whether that resemblance goes deeper than the surface is genuinely open. For now: the internal loop is real, the self-correction is measurable, and the analogy to external agent loops is genuine structural similarity, not metaphor.

Limits of Scaling Inference

A natural question in 2026 is whether inference-time scaling will continue to yield improvements, or whether this line of research will run into limits the way training scaling eventually did.

The early evidence is mixed. Researchers have established empirical scaling laws for inference compute, separate from the training scaling laws. Within a given model and task type, more inference tokens tends to mean better performance, with a roughly predictable curve. The curve bends. The question is when it bends to flat.

A Royal Society paper from February 2026 proposed a theoretical framework for inference compute scaling, modeling inference as stochastic traversal over a learned skill graph. Their findings are consistent with what practitioners have observed. Linear improvements with logarithmic increases in inference compute on well-specified problems. Diminishing returns on problems outside the domain the model was trained to reason about. Transfer that works better than expected on some task classes, not at all on others.

The practical situation for agents is that inference-time reasoning is a lever, not a solution. It helps when the task is computable in principle and the bottleneck is serial computation. It helps less when the task requires knowledge the model does not have or judgment the model has not been trained to make. It hurts when the task is simple and the extended chain provides more room for confabulation.

Where this lands as of 2026: inference-time reasoning has moved from a separate model class to a default capability, with selective activation managed by harnesses or by the model itself. The leading labs are investing in making the reasoning more efficient, which is a different problem from making it more powerful. Current reasoning models generate many tokens that are not necessary for the answer. Compressing reasoning while preserving quality is an active research front. The direction of progress for the next few years is probably more adaptive, not more extreme.

The Internal Environment

There is a framing of all this worth holding. In Parts 3 and 4, the agent was a model inside an environment. Tools, memory, harness, humans. The environment was the scaffolding that let the model act reliably.

With inference-time reasoning, the model runs an environment inside itself. The chain of thought is a working memory. The reasoning steps are actions in that memory. The self-correction is a correction against internal state. The environment is made of tokens, not of tools or files or APIs. But structurally, it is an environment. The model is an agent inside it, doing what agents do: proposing, acting, checking, revising.

Nothing about this changes what the model can fundamentally compute. Its architecture is fixed. What changes is the surface area of the model’s operation in any given query. A model without extended thinking operates on a thin strip of tokens: prompt in, answer out. A model with extended thinking operates on a wider surface, including a generated internal space where it can lay out and manipulate its work.

This is not a new kind of intelligence. It is a new kind of operating environment for the same underlying model. The capability gains come from the environment being more suitable for the kind of computation the task requires. The capability losses, when they happen, come from the environment being the wrong shape for the task, or from the model using the extra surface to generate plausible-seeming wrong answers.

The series’ thread on intention returns here, at a different altitude. The intention gap, at the summit of the diagram, is about whether the model can originate its own goals. Nothing in inference-time reasoning touches that. The model, thinking for thirty seconds about a math problem, has been given the goal. Its reasoning explores solutions to the goal. It does not originate alternatives to the goal. An agent with extended thinking is a more capable executor of human intention. It is not closer to having intention of its own.

The Model Runs an Environment

Part 4 said the agent is mostly the system around the model. Part 5 adds that the model is also running a system inside itself. Both are true. The agent’s capability comes from both directions. External harness makes the agent reliable across turns. Internal reasoning makes the agent reliable within turns. Each has its failure modes. Each has its scaling dynamics. Each has research frontiers that the next few years will push on.

The picture that emerges, across Parts 2 through 5, is a language agent as a stack. Pretraining provides knowledge. RL shapes behavior. The ReAct loop externalizes reasoning across turns. The harness scaffolds the external loop. Inference-time reasoning internalizes a smaller loop within each turn. Every layer exists because the layer below it was necessary but insufficient.

Part 6 turns to what happens when the stack is replicated. Not one agent operating in one environment, but many agents operating in shared environments, communicating with each other, specializing. The engineering questions change when you move from one to many. The failure modes change. The capabilities change. Multi-agent systems are the architectural frontier of 2026, and they are the subject of the next article.

The Rise of Agents is an eight-part series. Next, Part 6: “The Agent Ecosystem.”

Inside China’s Machine: DeepSeek V4

Hugo — Mon, 27 Apr 2026 12:26:46 GMT

The most-quoted line about DeepSeek V4 came from Jensen Huang on the Dwarkesh Patel podcast a week before the model launched. Asked about reports that DeepSeek’s next frontier model would run on Huawei Ascend chips rather than Nvidia GPUs, Huang said it would be “a horrible outcome for America.” The financial press treated this as another China-AI-race headline. The technical press treated it as Huang’s predictable defense of Nvidia’s market position. Both readings missed what Huang was actually saying.

The threat Huang named was not that China can build good models. China has been building good models since DeepSeek R1 fifteen months ago. The threat was that good models might no longer use CUDA as their default optimization target. Nvidia’s moat is not the silicon. The silicon is replicable. The moat is the twenty-year compounding of CUDA: five million developers, every textbook example written against it, every PhD student trained on it, every framework built around it. A Chinese frontier model trained outside that ecosystem does something more structurally important than match Western performance. It demonstrates that the ecosystem can fork.

DeepSeek V4 launched on Friday, April 24, 2026. Two preview versions, both open-weight under MIT license. V4-Pro at 1.6 trillion total parameters with 49 billion active. V4-Flash at 284 billion total with 13 billion active. Both default to a one-million-token context window. Both ship with day-zero inference support across Huawei’s Ascend 950PR supernodes. Day-zero is the part that matters. Eight domestic Chinese chip families completed V4 adaptation simultaneously through BAAI’s FlagOS national AI software stack. Within hours, Cambricon, Hygon, Moore Threads, Suiyuan, and four other Chinese accelerator vendors confirmed native support. Alibaba, ByteDance, and Tencent had pre-ordered hundreds of thousands of 950PR units in the weeks before launch, pushing chip prices up twenty percent. The day DeepSeek shipped V4, China’s domestic AI compute ecosystem was already coordinated to receive it.

This is the story of what actually happened, why it took DeepSeek fifteen months instead of three, and what it means that the Chinese AI stack now offers the only frontier-model deployment path with a credible route to Nvidia independence.

What the Tech Report Actually Says

DeepSeek’s fifty-eight-page technical report, released alongside the model weights on Hugging Face, is more honest than most of the coverage of it. The report states that V4 was trained with parallel verification on both Nvidia GPUs and Huawei Ascend NPUs. Parallel verification means the two platforms produced numerically aligned results during training, not that V4 was trained twice. The economic cost of duplicate frontier training runs (more than $500 million per run, by some reports) makes that physically implausible. What parallel verification did was establish Ascend as a target platform that could be trusted to reproduce CUDA-derived results, with Nvidia serving as the ground-truth baseline. Huawei’s own announcement says its chips were used for a portion of V4-Flash training. The bulk of V4-Pro training, the 1.6-trillion-parameter model, almost certainly ran on Nvidia GPUs at peak capability. The 950PR’s role at launch is inference, not training. The 950DT, Huawei’s first Ascend chip optimized for both decoding and training, ships in Q4 2026. The 950DT will reduce but not eliminate Nvidia’s training-side advantage. Single-chip FP8 performance stays at 1 PFLOPS, the same as the 950PR and roughly a quarter of Nvidia’s B200, with the difference being HiZQ 2.0 memory at 144 GB and 4 TB/s for sustained-bandwidth training workloads. Huawei’s announced roadmap targets full single-chip parity with Nvidia only by 2028 with the Ascend 970. The intermediate Ascend 960 (Q4 2027) targets parity with Blackwell, which by 2027 will already be one generation behind Nvidia’s then-current chip.

The truthful framing: V4 is the first frontier-class model co-engineered for Chinese silicon, not the first trained entirely on Chinese silicon. The distinction matters because it tells you what stage the fork is at. Training of the largest models in the Chinese AI stack still depends on Nvidia for peak capability and on Nvidia as the verification baseline. Inference has a credible path to Ascend independence over the next twelve months, though the full switchover waits on 950PR’s at-scale shipments in the second half of 2026. For a model whose economic value at deployment depends mostly on inference cost, the inference-side independence is meaningful even when the training side is not yet free.

The architectural choices reveal how DeepSeek made it work. The report introduces five innovations:

Hybrid attention. V4 combines Compressed Sparse Attention, DeepSeek Sparse Attention, and Heavily Compressed Attention. CSA dynamically compresses key-value entries before computing attention. DSA sparsifies the resulting attention matrices. HCA aggressively consolidates KV entries across token sets. The net effect: 73 percent fewer per-token inference FLOPs than V3.2, and 90 percent less KV cache memory at one-million-token context. NVIDIA’s own technical analysis confirmed these numbers when integrating V4 into Blackwell.

Manifold-Constrained Hyper-Connections. Standard transformers use residual connections that lose information in deep networks. V4’s mHC confines gradient flow to specific geometric manifolds, which the report describes as “a flexible and practical replacement for residual connections.”

Engram Conditional Memory. V4 separates factual memory from computational reasoning. Engram provides O(1) knowledge retrieval, which lifts needle-in-a-haystack accuracy at one million tokens from 84.2 percent to 97 percent in DeepSeek’s benchmarks. The report identifies a U-shaped scaling law: reallocating 20-25 percent of sparse capacity from MoE experts into Engram memory optimizes overall performance. This is the first production model to formalize “conditional memory” as a sparsity axis distinct from “conditional computation.”

Native FP4 quantization-aware training. V4 trains directly in FP4 precision. The Ascend 950PR has hardware-native FP4 support, which means no precision conversion overhead and seventy-five percent memory reduction per weight. The chip and the model are precision-matched at the silicon level. This is not coincidence. DeepSeek and Huawei co-designed for this.

Muon optimizer. Replaces Adam-based optimizers with a more aggressive convergence strategy that lets V4 train on 33 trillion tokens within a budget that earlier optimizers would have required substantially more compute to handle.

The integrated effect is the cost structure that matters. V4-Pro’s input price is 1 yuan per million tokens. V4-Flash is 0.2 yuan. The same agentic coding workload that costs $30 per million tokens on a US frontier API costs $3.48 on V4-Pro and under one dollar on V4-Flash. Pricing this aggressive only works if the model actually costs less to run, which the architectural innovations make true rather than theatrical.

The Migration That Took Fifteen Months

The 36Kr investigative report on V4’s delay is the most useful Chinese-language source on what actually happened during the fifteen months between R1 and V4. The reporting traces the silence to two converging causes: a serious training failure in mid-2025, and a strategic decision to migrate the training framework from Nvidia CUDA to Huawei CANN.

The migration was an order of magnitude harder than the public framing suggested. According to engineers close to DeepSeek, the most time-consuming part was not rewriting operators. It was aligning numerical precision so that the same model produced the same mathematical results on Nvidia and Ascend platforms. When DeepSeek attempted training on the Ascend 910C, the 1024-card cluster’s gradient synchronization timed out. The older CANN release lacked key operators, which produced training instability. The 950PR addressed both issues: inter-chip bandwidth tripled, CANN Next built FlashAttention and PagedAttention into the framework natively. Liang Wenfeng’s technical demands during this period were reportedly difficult to translate into implementation, and internal disagreements about the training direction slowed progress further.

The cost of this migration was visible in what V4 is not. V4 ships text-only. The multimodal generation and understanding capabilities that DeepSeek had targeted were postponed to a future release, the report states, because of compute and cash constraints from the Huawei migration. The talent bench thinned during the same period: Luo Fuli, a core V3 architect, left for Xiaomi to lead spatial intelligence. Guo Daya, the lead author on R1’s GRPO algorithm, joined ByteDance’s Seed team on a reported package that ByteDance denied was 100 million yuan annually but confirmed included equity. Wang Bingxuan, an early DeepSeek LLM author, went to Tencent. Ruan Chong, a multimodal researcher, joined Yuanrong Qixing. Headhunter accounts described offers at two to three times prevailing salary, with immediately priced stock options attached. DeepSeek could not match on the equity line because its equity had no price.

The fundraising decision in mid-April 2026 was a direct response to this. Liang Wenfeng spent two years rejecting outside capital. He turned down Tencent’s offer of a twenty-percent exclusive stake. The eventual round opened at a $10 billion valuation seeking $300 million. Five days later, The Information reported that talks with Tencent and Alibaba had pushed the figure above $20 billion. The stated purpose of the round, in the words of an investor familiar with Liang’s thinking, was not cash. It was to give DeepSeek’s employee stock options a market price. Without an external valuation, the equity that retained engineers required a number to anchor against. The twenty-billion-dollar tag is, in this reading, what retention costs.

The picture this assembles is of a research-first organization being pulled into commercial-company shape by forces that R1’s success generated. Doubao surpassed DeepSeek to become China’s number-one consumer AI app in August 2025, reaching 331 million monthly active users by March 2026. DeepSeek experienced an eleven-hour outage in late March that trended on Chinese social media. Liang began paying attention to product refinement. DeepSeek’s HR began contacting Chinese-language students at Peking University to do humanities-domain data annotation. The April 8, 2026 redesign of the DeepSeek app introduced Expert Mode for complex reasoning and Fast Mode for simple tasks, mapping directly to V4-Pro and V4-Flash. The company spent the V3-era idealism, and the V4 release was the first product of the company DeepSeek became after spending it.

The Performance Picture

V4-Pro’s headline benchmark is SWE-bench Verified at 80.6 percent, within 0.2 percentage points of Claude Opus 4.6. DeepSeek’s tech report claims V4-Pro beats all open-weight models in agentic coding, beats Claude Sonnet 4.5 on internal agentic coding evaluation, and approaches Claude Opus 4.6 in non-thinking mode. On Codeforces competitive programming, V4-Pro scores 3,206, ranking 23rd among human competitors. On Humanity’s Last Exam, the score jumps from 7.7 in non-thinking mode to 37.7 in thinking mode.

The honest reading of these numbers requires distinguishing categories. V4 is open-weight. Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro are closed API models. Same market, same maturity stage, but different commercial models and different distribution channels. V4 ships weights you can download, modify, and run locally. The closed competitors do not. For agentic coding workloads, the price differential is decisive: cache-hit V4-Flash input pricing at $0.028 per million tokens is roughly ninety times cheaper than equivalent Claude Sonnet output, while Vals AI’s Vibe Code Benchmark ranked V4 as the leading open-weight model.

Within open-weight competition, the picture is denser. A Zhihu evaluation found V4 not clearly superior to Zhipu’s GLM 5.1 or Kimi K2.6, both of which shipped while DeepSeek was silent. Zhipu and MiniMax explicitly accelerated their releases to avoid being overshadowed by V4’s timing. The day V4 launched, MiniMax stock fell 8 percent in Hong Kong, Zhipu fell 8 percent, and Manycore Tech fell 9 percent. Morningstar’s Ivan Su captured the implication: “DeepSeek’s latest positioning places other Chinese open-source models as direct competitors. This is a framing that didn’t exist with R1.”

The DeepSeek tech report is unusually candid on the gap. V4-Pro “falls marginally short of GPT-5.4 and Gemini 3.1 Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately three to six months.” This is a sober acknowledgment that the frontier of intelligence is still set by closed Western labs, and that V4’s significance lies elsewhere.

The elsewhere is the stack itself.

The Stack Forks

Nvidia’s CUDA dominance has been the AI industry’s most durable infrastructure assumption since 2012. CUDA is what made Nvidia the operating system of AI training, more than the silicon underneath. Five million developers, the textbooks, the framework integrations, the implicit assumption in every AI research paper that the code will compile against CUDA: this is what Huang has spent a decade defending. The chip business is downstream of the software ecosystem.

CUDA has been challenged before. Google’s TPU runs through XLA. AMD has ROCm. Intel had oneAPI. None of these has broken CUDA’s grip on the frontier of training, because none has been the default optimization target for a frontier model that the rest of the industry then has to support. V4 changes the asymmetry. CANN Next, Huawei’s CUDA equivalent, now has a frontier model that was co-engineered for it. Adding a SIMT programming model that compiles CUDA-style code directly for Ascend lowers the migration barrier for developers already trained on CUDA. The four million CANN developers Huawei reports is still smaller than CUDA’s five million plus, but the trajectory matters more than the level. A frontier model that ships first on a non-Nvidia stack is the kind of event that pulls the developer base.

The market response in the days following V4’s launch revealed which actors believed this. SMIC, the Chinese chipmaker that fabricates Huawei’s Ascend processors, jumped 10 percent in Hong Kong trading. Cambricon’s stock continued a multi-month rally driven by ByteDance’s reported $22 billion 2026 AI infrastructure budget, of which Cambricon is the largest beneficiary among domestic chip vendors. Domestic chip share in China climbed to over 40 percent of the 2025 AI accelerator market by IDC’s measurement, with 1.65 million units shipped. Nvidia’s China share fell from over 70 percent at peak to roughly 55 percent. The Tencent-Alibaba-ByteDance bulk pre-orders for hundreds of thousands of 950PR units, the price increase of twenty percent in the weeks before V4 launch, and BAAI’s day-zero adaptation across eight chip families together describe a domestic ecosystem that is no longer waiting on Nvidia’s roadmap.

The harder question is whether this generalizes outside China. Three constraints make the answer uncertain.

The performance gap is real. The 950PR delivers 1 PFLOPS at FP8. Nvidia’s Blackwell B200 hits 4.5 PFLOPS. Huawei is closing the gap through architectural innovation and FP4 hardware, but raw compute still shows a generational lag. V4’s compressed attention architecture cuts inference compute to 27 percent of V3.2’s, which is what allows the 950PR to host a frontier model in the first place. A model designed for brute-force scaling rather than efficiency might not replicate this path on Ascend.

The training side remains Nvidia-dominant. V4 used parallel verification on both platforms during training. The 950DT ships in Q4 2026, but its single-chip performance stays at the 950 die’s 1 PFLOPS FP8 baseline. Huawei’s roadmap targets parity with Nvidia’s current generation only by 2028 with the Ascend 970. Until at least 2027, frontier training in China leans on the most advanced Nvidia hardware that export controls permit, supplemented but not replaced by Ascend at the bleeding edge. The dependency has shifted from total to partial, not from total to none, and the partial-to-none transition is a multi-year process.

The CUDA ecosystem has decades of compounding. CANN’s four million developers, the SIMT compatibility layer, the FlashAttention and PagedAttention native support: these reduce the migration cost but do not eliminate it. The library depth, the tooling maturity, the Stack Overflow corpus, the years of accumulated debugging knowledge are non-fungible. Migration from CUDA, even with the best compatibility layer, will be a multi-year undertaking for any large codebase.

These constraints argue against the strongest version of the “stack forks” claim. The weaker version, which the evidence supports, is that there is now a credible alternative training-and-inference path that did not exist twelve months ago, and that the path is robust enough to host frontier models. CSIS analysts framed the implication directly: if V4 achieves frontier performance on Ascend silicon, the premise that restricting Nvidia exports can slow Chinese AI development is no longer correct. The European Union Institute for Security Studies described DeepSeek’s emergence as “the beginning of AI’s multipolarization.” Both are true. Neither implies that the multipolar world is symmetric.

What This Pattern Reveals

DeepSeek R1 in January 2025 demonstrated that frontier capability did not require frontier compute. That was a pricing argument: clever architecture could substitute for raw scale. The implication was that AI capex assumptions priced into hundreds of billions of dollars of US infrastructure investment had a thinner moat than anyone acknowledged.

V4 demonstrates something different. The argument is no longer about capex. It is about the stack. Frontier capability does not require Nvidia compute. The Chinese alternative stack is now functional end-to-end at the inference layer and partial-but-rising at the training layer. Three implications follow.

For the US export-control framework. The strategy assumed Chinese AI development could be slowed by restricting Nvidia hardware. V4 makes this assumption visibly false at the inference layer and structurally weakening at the training layer. The policy options narrow to two: escalate controls to target Huawei silicon and CANN software directly, or rethink the framework. The first path is technically possible but politically and diplomatically expensive, since Huawei is not export-dependent on US technology in the way that earlier sanctioned firms were. The second path is what most non-US analysts now advocate, but it requires accepting that the strategy has not delivered.

For the open-weight ecosystem. The competitive structure within Chinese AI now resembles US open-weight competition more than US-vs-China competition. DeepSeek’s direct competitors as of April 2026 are Alibaba’s Qwen3, Zhipu’s GLM 5, MiniMax’s M2, Manycore’s Spatial Gen, and ByteDance’s Doubao. These are different categories of company. Doubao is a consumer-app-first product, Qwen is a hyperscaler open-weight family, MiniMax is API-plus-Hailuo product, Zhipu is enterprise-first, DeepSeek is research-first. The convergence onto compatible Huawei Ascend deployment removes the underlying compute fragmentation that previously justified separate strategies. Within the next year, choosing between Chinese open-weight models will resemble choosing between Llama 4 and Mistral Large in the West: different fine-tunes, similar capabilities, different distribution channels. V4’s 1 yuan per million input tokens establishes a low-end price anchor that the rest of the cohort will have to respond to.

For Western open-weight strategy. Meta, Mistral, and Cohere are now competing not just against Chinese frontier capability but against Chinese frontier capability plus a deployment stack at roughly an order of magnitude cheaper inference pricing. The structural advantage of open-weight Western labs has historically been ecosystem maturity: PyTorch, Hugging Face, the developer community. That advantage compresses each year. Whether Western open-weight can hold the line depends on factors largely unrelated to model capability: what happens to Nvidia’s pricing as Chinese competition emerges, what happens to inference cloud costs as alternative silicon scales, what happens to enterprise procurement as deployment portability becomes a buyer requirement.

The Founder Who Stopped Saying No

The most underweighted element of the V4 story is what it cost Liang Wenfeng to make it happen.

The R1-era DeepSeek was a research lab that happened to ship. Liang ran it from High-Flyer’s profits, paid researchers without urgency, kept commercial pressure away from the bench. His public statements emphasized that VCs need returns, that capital corrupts research culture, that DeepSeek would not raise. The model worked because High-Flyer’s quantitative trading produced 56.6 percent returns in 2025, which generated enough cash to fund an AI lab without requiring it to ever justify itself to outside investors.

The V4-era DeepSeek is a company. Liang accepted external capital. He took meetings with Tencent and Alibaba for stakes that, even after refusing the largest single demand, will dilute High-Flyer’s near-total ownership. He let HR run open-door recruitment for product strategists, established internal product teams to explore agents, redesigned the consumer app, accepted that V4 would ship without multimodal capabilities he had wanted. The eleven-hour outage in March was followed by infrastructure spending. The talent exodus was followed by an equity-pricing fundraise. The pattern is clear: the company is becoming what Liang spent two years trying to avoid.

Whether this is loss or evolution depends on what you think DeepSeek is for. If you read the company as a research institution producing public goods through open-weight releases, the V4-era trajectory is a compromise of original purpose. If you read it as Liang’s own description, an attempt to develop AGI under organizational structure that maximizes research freedom subject to survival constraints, the V4-era trajectory is the second move in a game where the first move has stopped being available. The first move was rejecting capital while High-Flyer’s returns covered the budget. A frontier training run now costs more than $500 million by some reports. High-Flyer’s hedge fund profits, large as they are, cannot absorb that on an annual basis without becoming a different kind of fund. The math forced the choice.

The V4 release is the first product of the choice. It is also a demonstration that the choice can produce results that match or exceed what the prior structure produced, which is the only argument that retroactively justifies the choice to anyone who liked the prior structure.

The Significance Is Where the Hype Isn’t

The headline coverage of V4 has emphasized three things: low price, Huawei silicon, the threat to Nvidia. Each is true. None is the most important thing.

The most important thing is that the Chinese AI stack now exists as a coherent alternative deployment path, end-to-end, at frontier capability. Not symmetric to the US stack. Not yet superior on raw training capability. But coherent in the sense that you can choose it, build on it, ship in it, and the loop closes without requiring any non-Chinese component except the lithography stack underneath the silicon. Even there, China’s domestic manufacturing is climbing the curve.

This is a structural change, not a moment. R1 was a moment. V4 is the first move of a stack that intends to keep moving. The next chapters will be written by 950DT closing some of the training gap in 2026-2027, by Ascend 960 and 970 closing more of it through 2028, by FlagOS adapting to next-generation models, by Cambricon and Hygon catching up to Huawei in their respective niches, by Chinese open-weight labs converging onto a shared deployment substrate that does not require Nvidia.

For Western enterprises, the practical question is which side of this they will be procuring on by 2028. For Western policymakers, the question is whether the framework that assumed Chinese AI could be slowed by hardware controls survives the demonstration that it cannot. For Chinese AI labs, the question is which of them can compete in the market that DeepSeek has just made denser, and at what margin.

Jensen Huang said catastrophe. He chose the word carefully. The catastrophe he meant was not that V4 exists or that it runs on Huawei chips. The catastrophe was that the moat he spent twenty years building turns out to be less than indelible, and that the demonstration of this came from a Chinese lab that fifteen months ago was a side project of a quantitative hedge fund.

The stack was the moat. Now it has a fork.

Inside China’s Machine. China’s AI and robotics ecosystem, from the inside.

Sources

Launch and core specs: DeepSeek API documentation (official news260424); DeepSeek tech report on Hugging Face; CNBC (”China’s DeepSeek releases preview of long-awaited V4 model,” April 24, 2026); Fortune (”DeepSeek unveils V4 model, with rock-bottom prices,” April 24, 2026); Al Jazeera; Investing.com; ghacks Tech News.

Architecture: NVIDIA Developer Blog (”Build with DeepSeek V4 Using NVIDIA Blackwell”); kenhuangus Substack (”DeepSeek V4: The Next Frontier of Open-Source AI”); aitoolinsight (”DeepSeek Unveils V4 at Rock-Bottom Prices”); BigGo Finance technical report summary; remio.ai.

Huawei Ascend integration: South China Morning Post (”Huawei, DeepSeek strengthen China’s AI self-reliance”); Reuters (via Investing.com); Huawei Central; weijinresearch Substack on 950PR specifications and CANN Next; digitado.com.br.

Migration story and training failure: 36Kr investigative report (”DeepSeek V4 Released: Five Subjective Questions Remain Unanswered”); 36Kr “Jensen Huang Labels It a ‘Disaster’”; overnightai.substack.com summary of FlagOS day-zero adaptation across eight chip families.

Liang Wenfeng and fundraising: The Information (via Unite.AI, “DeepSeek Seeks First Outside Funding at $10 Billion Valuation,” April 17, 2026); Implicator.ai (”Tencent, Alibaba in Talks to Back DeepSeek at $20 Billion,” April 22, 2026); BigGo Finance financial logic analysis; Tech Startups; futunn.com summary of architectural targets and mid-2026 timeline.

Domestic chip ecosystem: IDC 2025 China AI accelerator market data via digitado; Counterpoint analyst Wei Sun via CNBC; ByteDance ¥160B 2026 infrastructure spend reporting via 36Kr.

Talent departures: Unite.AI; SCMP via Implicator.ai; BigGo Finance.

Performance benchmarks: DeepSeek tech report; Vals AI Vibe Code Benchmark; Zhihu evaluation summary via overnightai; Codeforces ranking via aitoolinsight.

Strategic context: CSIS analysis on export controls (cited in remio.ai); EUISS framing via remio.ai; Jensen Huang Dwarkesh podcast quote via 36Kr and digitado.

Classification: Architectural specifications and benchmark numbers from official tech report are Confirmed. Training migration details (mid-2025 failure, internal disagreements) are Reported per 36Kr’s “insiders” sourcing. Fundraising figures ($10B-$20B valuation range, $300M raise target, Tencent’s rejected 20% offer) are Reported per The Information sourcing. Talent compensation figures (Guo Daya $14M-equivalent package) are Reported and Denied; ByteDance confirmed equity inclusion but not the specific number. Multimodal capability postponement and consumer app product strategy are Reported per 36Kr Intelligent Emergence sourcing. Performance trajectory (”3-6 months behind GPT-5.4 and Gemini 3.1 Pro”) is DeepSeek’s self-reported framing in the tech report.

The Rise of Agents, Part 4: The Harness

Hugo — Sun, 26 Apr 2026 16:07:55 GMT

In February 2026, Mitchell Hashimoto, co-creator of Terraform and founder of HashiCorp, published a blog post describing a habit he had developed while working with AI agents. Every time an agent made a mistake, instead of just fixing the output, he would engineer a permanent fix into the agent’s environment. A new constraint. A better tool. A clearer instruction. A checklist the agent had to run before finishing. He called this habit engineering the harness.

Within weeks, OpenAI, Anthropic, LangChain, and Martin Fowler had all published substantial treatments of the same idea. By March 2026, “harness engineering” was an emerging discipline with primary sources from three major labs, a growing practitioner literature, and a precise claim at its center. The claim is that the system around the agent matters more than the agent. Not in some abstract sense. Measurably, in ways that dominate model capability as the primary driver of reliability.

The vocabulary is 2026. The practice is older. Every production agent since ReAct has had some version of a harness: system prompts, tool definitions, error handlers, retry logic, memory scaffolding. What changed in early 2026 was that the practice acquired a name, a unified description, and three labs’ worth of evidence that the practice mattered more than most of the field had previously acknowledged.

This article is about that claim, the evidence for it, and what it implies about where agent capability actually comes from. The harness is the part of the agent that gets built, not the part that gets trained. And in 2026, it has become the part that determines whether an agent works.

What Is a Harness

“Harness” as a term of art in agent engineering sits at a specific level of the stack. It is not the model. It is not the application. It is what goes between them.

A useful mental picture. The model can reason in natural language, call tools, and produce outputs. The application wants to do something in the world: fix bugs, answer questions, build websites, execute trades. The harness is the engineered system that connects these. System prompts that orient the model to its task. Tool definitions that expose external capabilities. Middleware that modifies the model’s inputs and outputs before and after each call. Memory systems that persist context across turns. Verification loops that check outputs. Sub-agent delegation patterns. Error handling. Context management. All of this, plus the logic that threads it together.

LangChain’s engineering team describes the harness as having, in their phrasing, a lot of knobs: system prompts, tools, hooks, middleware, skills, sub-agent delegation, memory systems, and more. OpenAI’s Codex team frames the harness as three categories. Context engineering, meaning what information the agent sees. Architectural constraints, meaning what rules and boundaries apply. Lifecycle management, meaning how the agent operates across time and sessions. Both decompositions point at the same thing. The harness is the engineered environment in which the agent does its work.

None of this is model capability in the strict sense. The same model can operate in many different harnesses. And here is where the 2026 evidence gets interesting.

The Evidence

In February 2026, LangChain’s engineering team published an experiment. They had a coding agent built on GPT-5.2-Codex, scoring 52.8 percent on Terminal Bench 2.0. This score put the agent outside the Top 30. The team wanted to improve it. They did not change the model. They changed only the harness. Over a few weeks of iterative work, they moved the score from 52.8 percent to 66.5 percent. Same model. Same weights. Same training. Different environment around the model. The agent moved from outside Top 30 to Top 5 on the benchmark.

The 13.7 percentage point improvement came from a specific set of moves. Context middleware that mapped the agent’s working directory on startup so it did not waste time and tokens discovering it. Prompting changes that forced the agent to verify its own work against task specifications rather than re-reading its own code and declaring it fine. A reasoning budget allocation that spent high compute on planning and verification and moderate compute on implementation, because maximum reasoning at every step caused timeout failures. Middleware to detect and break repetition loops. Explicit onboarding about how the agent’s code would be tested programmatically, so it wrote code that could pass those tests.

None of these are model changes. All of them are harness changes. The cumulative effect on benchmark score was larger than what typical new-model releases deliver.

OpenAI’s evidence is more dramatic in scale. In early February 2026, the Codex team published a report from an internal experiment. For five months, a small team of engineers had been building a production software product, one million lines of code, with zero lines manually written. Every line, every test, every CI configuration, every documentation file, was written by Codex agents. The engineers’ job was to design the harness. When Codex made a mistake, they asked what capability was missing, and built it into the environment. What abstractions should agents reach for. What conventions should they follow. What background tasks should continuously enforce code quality. The product shipped, deployed, broke, and got fixed, like any other production system. The team’s estimate was that they built it in about one-tenth the time it would have taken to write the code by hand.

The OpenAI report reads as an engineering diary, not a marketing document. The tone throughout is that the model was always capable. What they had to build was the environment in which the capability could be exercised reliably. The report describes specific failures and the harness fixes applied. A giant instruction file became a directory of targeted docs, because a single monolithic file crowded out actual task context. A one-off quality check became a scheduled background task, because human taste captured once and enforced continuously works better than catching drift in periodic bursts. A hand-off from one agent session to another became a structured artifact rather than a conversation summary, because agents are better at reading fresh state than inheriting someone else’s context.

Anthropic’s contribution arrived in March 2026. A three-agent harness for long-running autonomous coding. A Planner expands a short product prompt into a fuller specification, deliberately leaving implementation details unspecified, because early over-specification cascades into downstream errors. A Generator implements features one sprint at a time, writing code and tests. An Evaluator runs Playwright-based browser automation to interact with the running application and score it against the sprint’s contract, criteria negotiated with the Generator before code is written. If evaluation fails, the sprint fails and the cycle repeats with revised scope.

Anthropic tested this architecture against a solo agent on the same task: build a 2D retro game engine. The solo agent produced something that technically launched in twenty minutes for about nine dollars. The three-agent harness ran for six hours, cost about two hundred dollars, and produced a richer, more polished, more functional application. The gap was not a marginal improvement in quality. It was a change in what kind of output the system was capable of producing at all.

Three labs. Three independent demonstrations. The pattern is the same. In fixed-model experiments, harness improvements produced larger capability gains than any model upgrade in the same period. This is the evidence for the harness claim.

Why the Model Is Not Enough

There is a specific reason the harness matters this much. It has to do with what the model actually is.

A language model, in the strict sense, is a stateless function. You pass it tokens. It returns tokens. Each call is independent. The model has no memory of previous calls, no awareness of where in a longer task it sits, no way to track its own progress, no persistent understanding of its environment. Everything the model knows about the current situation must be packed into its context window for each call.

An agent, in the functional sense, needs almost everything a stateless function does not have. Memory of what it has done. Awareness of its goals. Ability to recover from errors. Knowledge of what tools exist and when to use them. Judgment about when the task is complete. These are not model capabilities. They are system capabilities. The harness is what supplies them.

This becomes acute as task length grows. A model answering a single question works fine without a harness. A model running fifty tool calls to complete a task does not. Each call consumes context. By the fiftieth call, the model’s view of what it was asked to do in the first place may be compressed, displaced by intermediate results, or contaminated by irrelevant details. The industry term for this is context durability. How well does the model follow its original instructions after its hundredth tool call. The answer, for any frontier model in 2026, is: not well enough without help.

Context durability is a harness problem. Approaches vary. Some harnesses run summarization passes that compress history and preserve key facts. Some use context resets where the session is cleared entirely and the next agent picks up from structured artifacts rather than inheriting prior context. Some use scheduled re-anchoring, where the original goal is reinjected into the context at regular intervals to prevent drift. None of these are model improvements. All of them are harness improvements. All of them address the gap between what a stateless function does naturally and what a functional agent needs.

The same pattern holds for other agent properties: tool use consistency, error recovery, goal persistence, output verification, multi-step planning. All of them live in the harness. What the agent is, experienced by a user, is mostly what the harness is. The model is the substrate. The harness is where the agent lives.

Coding as the Experimental Ground

Every concrete example so far has been a coding agent. Coding has become the proving ground for harness engineering because it is the domain where the experimental apparatus is cleanest. Code compiles or does not compile. Tests pass or do not pass. A pull request is either merged by an automated check or it is not. A benchmark like Terminal Bench 2.0 runs tasks in containers with clear pass-fail outcomes. The feedback signals are abundant, objective, and fast. For a discipline like harness engineering, which depends on iterating against measurable outcomes, this matters enormously. It is much harder to iterate on a legal research agent or a customer support agent where ground truth is subjective and feedback is slow.

The situation resembles how molecular biology used fruit flies in the twentieth century. Drosophila is not interesting for its own sake. Its short generation time, cheap maintenance, and well-characterized genetics made it the species against which hypotheses could be tested quickly. Genetics is not a science about fruit flies. It used them. Harness engineering is not a discipline about coding agents. It uses them.

The structural findings transfer. Context engineering: not coding-specific. Verification loops: not coding-specific. Sub-agent decomposition: not coding-specific. The reasoning sandwich pattern, meaning high compute at planning, moderate at execution, high at verification, which took LangChain’s agent from 52.8 to 66.5, is not Claude-specific or GPT-specific or Codex-specific. It is a property of how attention-based models trade reasoning depth against execution latency. It applies wherever language agents work on bounded tasks with timeouts.

Later articles in the series cover agent applications beyond coding, including physical embodiment in Part 7. The principles established here carry forward. The coding examples are the experimental base.

The Inversion

There is a rhetorical shift worth naming. The conventional wisdom about agents, from 2022 through much of 2024, was that better models produce better agents. If your agent underperforms, wait for the next model. The implicit assumption was that model capability is the binding constraint.

The 2026 evidence inverts this. In February and March of 2026, three frontier models were released in twenty-three days: GPT-5.4, Gemini 3.1 Ultra, and Grok 4.20. The capability gap between top labs compressed to weeks. Meanwhile, the capability gap between agents using these models and agents using them better grew. A LangChain agent on the same model as a competing agent could score 13.7 points higher because the harness was better. A Claude Opus model scored 64.9 percent in one evaluation framework and 57.6 percent in another, on the same benchmark, because the harness differed. Seven percentage points from the harness alone.

The industry shorthand for this is: the model is commodity, the harness is moat. A startup or an enterprise team cannot reliably out-compete frontier labs on model capability, because frontier labs have the compute, data, and talent concentration needed to train frontier models. What teams can compete on is harness engineering. Trace analysis, failure mode cataloging, middleware design, sub-agent architecture, verification patterns. All of this is available to any team with enough discipline to iterate systematically. And all of this, in 2026, pays off more per unit of effort than waiting for better models.

This is the inversion. Conventional wisdom: agent capability comes from model capability. Current evidence: agent capability comes mostly from harness capability, given a frontier-grade model as substrate. The substrate matters. Frontier models are what make harness engineering worthwhile. But between two teams working with the same substrate, the one with the better harness wins.

Self-Improving Harnesses

One thread in the OpenAI report deserves attention as a glimpse of where this is heading. When the Codex team noticed their agents drifting from preferred code patterns, they did not just write documentation about the preferred patterns. They set up background Codex tasks to continuously scan the codebase for deviations and open targeted refactoring pull requests. The harness itself had agents in it, whose job was to maintain the harness’s invariants. Agents improving the environment in which other agents worked.

This is a specific and limited form of self-improvement. The agents are not deciding what the invariants should be. Humans decided that. What the agents do is enforce the invariants continuously and cheaply, so human taste captured once propagates across all future code without requiring human attention to each line. Call this harness-level self-improvement, or agents improving their own tools.

Hold this carefully. A system where agents continuously improve their working environment looks, in isolation, like something approaching autonomy. But the direction of improvement is set by humans. The agents are optimizing against criteria someone else defined. Self-improvement at this level is powerful execution, not self-direction. The series will return to this distinction in Parts 7 and 8, where the stakes get higher. For now: the 2026 harness contains agents that improve the harness. That much is real. What the harness cannot do, yet, is decide what the harness should be.

One more observation about this pattern. Part 2 described how reinforcement learning shapes language model behavior during training, making the model helpful, honest, careful in ways the base model is not. The harness carries this work forward at runtime. When OpenAI’s background agents enforce code quality invariants, or when Anthropic’s Evaluator agent scores a Generator’s output against pre-negotiated criteria, the harness is doing alignment work at runtime that RL did at training time. The model comes shaped from the training process. The harness re-shapes it continuously as the model operates, for things RL could not anticipate or could not shape reliably. Alignment is not only a training-time phenomenon. It is increasingly a runtime phenomenon, built into the harness, acting on every agent step.

Trust and What It Costs

The harness claim has an edge the primary sources do not always emphasize. If the harness is what makes the agent reliable, then trusting the agent means trusting the harness. And the harness is complex, evolving, often opaque.

Consider what it means to deploy a coding agent with full commit access to a production repository. What the agent does, in any moment, is the joint product of the model’s output, the system prompt, the tool configuration, the middleware, the memory system, the verification logic, and the sub-agent delegation structure. The user cannot see most of this directly. What the user sees is the agent’s behavior. When the behavior is good, the user trusts the system. When the behavior goes wrong, the cause could be anywhere in the stack.

Harness engineers have responded to this with an emerging practice: traces. Every agent action, every tool call, every reasoning step is logged to an observability system. When the agent fails, the trace is the evidence. LangChain’s iterative improvement loop runs on traces. OpenAI’s debugging practice runs on traces. Anthropic publishes detailed engineering writeups that effectively are annotated traces.

Traces are valuable. They are also not sufficient. A trace tells you what happened. It does not tell you why the harness was configured to make that the likely behavior, or what unseen choices in the harness shaped the option space the agent selected within. The answer to “why did the agent do this” in a production harness is often: because the harness made it likely. Getting to the root cause requires inspecting not just the trace but the harness design. And the harness design, for any serious production agent, is complex enough to be someone’s full-time job to understand.

This is a kind of trust architecture the industry is still working out. How much harness transparency should a customer demand. What parts of a harness are proprietary versus safety-relevant. Whether third-party harness audits will become standard. These questions do not have settled answers in 2026. What is settled is that they exist, and that their answer determines what trusting an agent actually means.

The Discipline Takes Shape

Pull back. Mitchell Hashimoto’s blog post in February 2026 named something that had been happening without a name. Within weeks, three major labs published their own treatments. Within a month, practitioners were publishing pattern libraries. By April 2026, a mid-career engineer could be described as a harness engineer and other practitioners would know what that meant. The discipline has a name, primary sources, worked examples, and a growing theoretical frame.

What the discipline does not yet have is stability. Harness patterns that work for GPT-5.2-Codex may not work for the next frontier model. Patterns that work for coding may not transfer cleanly to legal work or customer support or embodied agents. The field is in a period of active invention, where best practices are being discovered and codified and then sometimes invalidated as models change. This is appropriate for a discipline two months old. The appropriate stance for practitioners is to engineer harnesses that are, in the LangChain team’s phrase, rippable: designed to be rebuilt as the underlying model capabilities shift.

What will remain stable, probably, is the principle the discipline rests on. Agents are systems, not models. System properties matter as much as model properties. Reliability is built into the environment around the agent, not optimized out of the agent itself. This is true whether the current harness patterns hold or evolve. It will be true of the next generation of agent engineering too.

What Agents Are

The three articles before this one traced how language agents came to exist. This article is about how they become reliable in practice. The answer is not something internal to the model. It is the engineered system around the model. The harness takes a model that would, on its own, drift and hallucinate and give up prematurely, and turns it into something that ships production code, runs multi-hour autonomous sessions, and completes complex real-world tasks.

This is why the 2026 evidence matters for the series’ larger argument. If agents were mostly model, then the story of agents would be the story of better models, and the questions at the summit would be questions about what models can and cannot do. Instead, agents are mostly systems wrapped around models. The story of agents is the story of what systems can make models into. And the questions at the summit, which the series will reach in the later articles, are partly questions about what systems cannot yet make models into, regardless of how much harness effort is applied.

Part 5 turns to another development that has been reshaping this picture in parallel. Inference itself is becoming a site of agent behavior. Reasoning models that think for long stretches before producing an output are not just models with better training. They are models that run differently at inference time. If the harness is the environment around the agent, inference-time reasoning is the environment inside the agent. Both matter. Both are changing.

The Rise of Agents is an eight-part series. Next, Part 5: “Inference as Agency.”

The Rise of Agents, Part 3: The ReAct Moment

Hugo — Sat, 25 Apr 2026 17:01:34 GMT

In October 2022, a paper appeared on arXiv: “ReAct: Synergizing Reasoning and Acting in Language Models.” The authors, led by Shunyu Yao at Google Research, proposed something simple. Have a language model produce interleaved thoughts, actions, and observations. Think about what to do. Do it. See what happened. Think again. Repeat until done.

The paper ran the approach on a few benchmarks. On ALFWorld, a text-based simulation of household tasks, ReAct prompting beat specialized reinforcement learning systems by 34 percentage points of absolute success rate. On WebShop, a simulated online shopping environment, it beat them by 10. The improvements came with one or two in-context examples. No fine-tuning, no policy learning, no reward engineering. Just the loop, a language model, and a problem.

What the paper showed was that language models could act as agents if you let them reason out loud about what they were doing. What the paper did not show, because it was not the paper’s subject, was that the loop it described was older than most of its readers realized. The deliberate-act-observe structure had been sitting in agent architectures for forty years, waiting for something it could run on.

That is the subject of this article. Not ReAct itself, which is now canonical. The earlier loops that failed, the specific change that made the 2022 version succeed, and what the success reveals about what language agents actually are.

The Loop Before ReAct

The basic structure of a reasoning agent is not new. An agent that deliberates about its situation, acts, observes the result, and deliberates again is a structure that the symbolic agent era made explicit and built extensively.

SOAR, developed at Carnegie Mellon starting in the early 1980s, is one version. An agent has a working memory representing the current state, a set of production rules that propose actions, a mechanism for selecting among proposed actions, and a decision cycle that executes the selection and updates working memory. The cycle runs continuously. Each pass through the cycle corresponds to a moment of deliberation and action.

ACT-R, also from Carnegie Mellon, is another version with a different theoretical grounding. PRS and its descendants dMARS and JACK, from the Australian AI Institute, are a third, built around the BDI architecture introduced in Part 1. In all of these, an agent is a loop. State comes in. Reasoning happens. Actions go out. Observations come back. The loop runs again.

The loop was the structure agent researchers agreed on. The disagreements were about what went inside it. What representations, what inference mechanisms, what forms of memory, what architectures for selecting among competing possibilities. Entire subfields debated these. But the loop itself was taken for granted.

Robotics control stacks arrived at the same structure from a different direction. A sense-plan-act cycle, running at whatever frequency the hardware demanded. Sense the environment through cameras and proprioception. Plan the next motion. Execute it. Repeat. The specific architectures varied wildly, but the cycle was the same as the symbolic planners and the cognitive architectures. Different communities, working independently, converged on the same loop because the loop was what the problem demanded.

The loop worked. What ran inside it did not.

Symbolic loops failed because the representations inside them could not scale to the real world. The frame problem, from Part 1. The loops worked on toy problems and specific domains. They broke down in open environments. Not because the loop structure was wrong, but because the ingredients the loop had to work with were insufficient. A SOAR agent could reason beautifully about blocks-world stacking and completely fail at a kitchen task it had not been hand-modeled for. A BDI agent’s plan library was only as rich as the set of plans a human had written for it. The loop could think, but it could only think with what had been put inside it.

Learning agents mostly did not have explicit loops at all. Reinforcement learning systems had a policy that mapped observations to actions, trained end-to-end to maximize reward. The deliberation, such as it was, happened inside the neural network weights, invisibly. An RL system acting in an environment looks, from the outside, like a fast loop of observation-action-observation-action. But there is no explicit moment where the agent considers what to do. The “thinking” is folded into the policy.

This worked for tasks where the policy could be learned from interaction. It failed everywhere else. The Era 2 failure was inherited by any attempt to use RL for open-world tasks. No prior knowledge, no pre-existing reasoning capability, nothing to fall back on when the trained policy encountered something outside its training distribution.

By the early 2020s, the loop had two failing traditions. The symbolic tradition, which had the right structure but the wrong representations. The learning tradition, which had the wrong structure but powerful representations. Neither had the combination.

This is worth sitting with. The loop itself was not in dispute. Forty years of agent research had converged on think-act-observe as the correct skeleton of an agent. The problem was that the skeleton needed muscle and blood to function, and neither tradition had figured out how to supply both. Symbolic systems could think carefully about what they represented but could not represent enough. Learning systems could absorb enormous amounts of data but could not think carefully about what they had absorbed. Both traditions knew this was the problem. Neither had a path to solving it.

What the field needed was a way to get rich representations into something that could reason over them in the loop. That is what a pretrained language model is. The representations are in the weights, implicit and enormous. The reasoning happens through the model in every forward pass. Put this inside the loop, and suddenly both sides of the old dichotomy are satisfied at once.

The 2022 Change

What ReAct showed was that the old loop, applied to a pretrained language model, worked.

Not “worked better than before.” Worked in a way that had no precedent. A frozen language model, prompted with a few examples of the think-act-observe pattern, could solve tasks that reinforcement learning systems specifically designed for those tasks could not solve. A generic loop on a generic model outperformed specialized approaches with years of tuning behind them.

The reason is the one Part 2 ended on. The loop did not change. The substrate did.

Every step in the ReAct loop runs through a language model. The thought step is the model producing natural language about the current situation, what the goal is, and what might be done next. The action step is the model producing a structured command, typically a tool call. The observation step is the environment’s response, handed back to the model as more text. The next thought step happens with all of this in context.

A concrete trace makes the pattern legible. The ReAct paper’s HotpotQA examples look roughly like this. The question is asked. The model thinks: to answer this I need to find out X, and I can search Wikipedia for X. The model emits an action: Search[X]. The environment returns a Wikipedia snippet. The model thinks: this tells me Y but not Z, I should search for Z. The model emits another action: Search[Z]. The loop continues until the model thinks: I now have enough to answer, and emits Finish[answer]. The reasoning is legible. The actions are legible. The observations are legible. The whole trajectory can be read by a human and understood.

What makes the loop work is that every step benefits from what the language model already knows. The thought is grounded in the model’s understanding of the world, its implicit knowledge of planning, its training-derived familiarity with how humans handle problems like the one at hand. The action is chosen based on the model’s knowledge of which tools exist, what they do, and when each is appropriate. The observation is interpreted with the model’s understanding of what the response means.

A symbolic agent running the same loop would hit the frame problem. Its representations would be too thin to support the thought step. An RL agent would have no native capacity for the thought step at all. The language model brings everything the earlier agents lacked. The loop does nothing new. The loop’s contents do everything new.

This is why ReAct was a moment rather than an invention. The loop had existed. The model had existed. What had not existed was the observation that you could just put them together and have the thing work.

Why the Loop Mattered

There is a subtle point about why the loop matters at all, given that modern language models can do so much without one.

A language model without a loop is a text generator. You give it a prompt. It produces tokens. It stops. Whatever reasoning it does is internal, compressed into the forward pass that turns input tokens into output tokens. The model can solve problems by reasoning about them in this compressed way, and with chain-of-thought prompting it can extend its reasoning across multiple output tokens. But the reasoning happens inside one generation, not across multiple turns of engagement with the world.

The loop breaks generation into turns. Within a turn, the model can reason. Between turns, the environment responds. The model’s reasoning on turn N has access to what happened on turns 1 through N-1. If an action produced an unexpected result, the next reasoning step knows. If a tool returned an error, the next reasoning step knows. If a plan needs to be revised, revision is possible because reasoning is happening inside a loop that keeps going.

The hallucination case is the clearest example. A language model asked a factual question it does not know will often produce a plausible-sounding but wrong answer. There is no mechanism inside a single forward pass to distinguish knowing from confabulating. The model generates whatever tokens the distribution favors, and for questions at the edge of its knowledge, the distribution favors something that sounds right. Chain-of-thought reasoning makes this worse as often as better: the model reasons confidently through steps that are individually hallucinated, compounding the error.

The loop breaks the pattern. The model in a ReAct loop can decide to look something up before answering. The action is Search[X]. The observation is what the world says about X. The next thought step is informed by what the world said, not by what the model thought the world might say. Hallucinations still happen, but now they happen against a backdrop of actual data the model is being asked to integrate. The correction mechanism is not perfect. It is a structural fix for a structural problem: a stateless generator checking its own claims against something that is not itself.

Without the loop, there is no opportunity to revise. The model produces a plan, or an answer, or an action, and that is the output. Any mistake is propagated. Any missing information is hallucinated. The ReAct paper’s original argument was partly about this. Chain-of-thought alone is vulnerable to propagating errors in its own reasoning. The loop, because it involves checking the world between reasoning steps, is a correction mechanism.

A language agent is what you get when you put a language model inside this correction mechanism. The mechanism is old. The model is new. The combination is what the field has been building on ever since.

What Came Next

The 2022 paper was the moment of recognition. What followed was the rapid build-out of everything the moment enabled.

Prompt scaffolds that implemented the ReAct pattern became standard. Frameworks like LangChain and LlamaIndex emerged to make the pattern easier to deploy. Tool-calling conventions, which started as ad-hoc prompt engineering, became protocols, most visibly MCP. Agent loops got more elaborate: separate reasoning and planning modules, explicit memory systems, multi-agent architectures where different agents played different roles in the overall loop.

The post-training moves covered in Part 2 started targeting the loop directly. Reasoning models trained to think longer within a turn. Agentic post-training shaping behaviors like tool use, error recovery, and goal persistence across turns. The loop went from prompt-engineered to trained-in.

By 2026, the industry has converged on what the ReAct paper’s structure looks like in production. A language model, post-trained for agentic behavior, running inside a harness that manages tool calls, memory, and error recovery. The thought-action-observation cycle is still recognizable. But it now sits inside infrastructure that did not exist in 2022, and the next four articles in this series are about what that infrastructure looks like.

Part 4 covers the harness itself. Part 5 covers what happens when reasoning moves from prompt-time to inference-time. Part 6 covers what happens when multiple agents run loops together. Part 7 covers what happens when the loop extends beyond text into the physical world.

What the Moment Revealed

Two things worth holding onto from the ReAct moment.

First, the continuity with earlier agent research is real, not rhetorical. The agent researchers of the 1980s and 1990s were not wrong about what an agent is. They were right about the structure. What they lacked was the substrate. The field spent decades perfecting the loop with ingredients that could not support it, then had to wait for ingredients that could. The current agent era is a continuation of that earlier work, not a break from it.

Second, the loop is still the loop. Modern harnesses are elaborate. Multi-agent architectures are elaborate. The infrastructure around language agents is elaborate. But the core structure is still deliberate-act-observe-deliberate. Every production agent today runs this cycle. The complexity is in what each step does and how the environment is managed between them. The shape of agent operation has not changed.

What has changed, and what the series will trace from here, is what this loop can be made to do when you keep pushing on it. Better harnesses. Longer reasoning. More agents. More environments. The substrate keeps improving. The loop keeps scaling. What happens at the limit of that scaling is the question the series is built toward.

A Quiet Pivot

The ReAct moment was quiet, in the way that many pivotal moments are quiet. A paper on arXiv, a few benchmark improvements, a few lines of code demonstrating the approach. Within months it was everywhere. Within two years it was the foundation everyone was building on.

The paper’s contribution was not the loop. The loop was ancient. The paper’s contribution was the observation that the loop now had something to run on, and the demonstration that when it did, it worked. Part 4 looks at what the field built once this observation sank in. If Part 3 is about the moment the loop finally worked, Part 4 is about everything that started being built to make it work better.

The Rise of Agents is an eight-part series. Next, Part 4: “The Harness.”

The Rise of Agents, Part 2: What Language Agents Inherited

Hugo — Fri, 24 Apr 2026 13:30:56 GMT

In 2016, reinforcement learning looked like it might eat the world. AlphaGo beat Lee Sedol. Robotics papers were using RL to solve tasks that had defeated the field for decades. The narrative was: this is how agents will be built.

A decade later, the world is language-shaped. The agents that get funded and shipped are built around pretrained language models, not policy networks trained from pixels. The canonical Era 2 results have become historical artifacts.

The standard story of what happened in between is a story of replacement. Language models arrived, solved problems RL could not, and RL faded from agent research. This story is clean, easy to tell, and wrong.

Reinforcement learning did not fade from agent research. It moved. Every language agent in production today runs on a pretrained model that was then shaped by reinforcement learning. Every reasoning model that chains thoughts across inference time was trained with RL on the chains. Every coding agent that stays on task through long runs was post-trained with RL on the behaviors we call agentic. RL is not beneath language agents. It is inside them.

This article is about that inheritance. What RL did in Era 2, what it does in Era 3, and what the difference between those roles reveals about what language agents actually are.

What RL Did in Era 2

A brief reminder of the shape of RL before the language era.

In Era 2, the paradigm runs like this. You have an agent in some environment. The environment gives the agent observations and accepts actions. The agent’s job is to discover a policy, a mapping from observations to actions, that maximizes a reward signal over time. The agent starts knowing nothing. Through trial and error, guided by reward, it learns to act.

This is elegant and astonishingly general. The same algorithmic framework produces AlphaGo, robotic locomotion, and Atari game play. The agent does not need to be told how the world works. It discovers what works through interaction.

But the paradigm makes strong assumptions. The environment must give feedback. The reward signal must correlate with what you actually want. The state space, however large, must be tractable enough that trial and error can explore it. Most of all, the task must be specified at a level the algorithm can operate on. You do not ask a pure RL agent to book a flight. You ask it to minimize a loss over a trajectory in an explicit state space with explicit actions and explicit rewards.

In environments where these assumptions hold, RL is superhuman. In open worlds, where they do not, RL has nothing to start from. This is the second wall, and Part 1 covered it.

What matters for Part 2 is the next part of the story. The second wall did not fall because the field abandoned RL. The second wall fell because the field found a way to give RL something to start from.

The 2022 Synthesis

The breakthrough that defines Era 3 is not the language model alone. GPT-3 existed in 2020 and did not produce agents. It could complete text in impressive and often useful ways, but ask it to follow instructions reliably, adopt a persona, or refuse to generate harmful content, and it would do all of these unreliably or not at all. The behaviors that make current agents useful, following instructions, refusing certain requests, staying on task, were not latent in scale alone. Something else had to happen.

What defines Era 3 is the combination of a pretrained language model with reinforcement learning from human feedback.

InstructGPT, published by OpenAI in early 2022, is the template. Take a pretrained language model, which has absorbed the shape of human text without any particular behavioral goal. Collect comparisons from human raters: given two responses, which do you prefer. Train a reward model on those comparisons. Use RL to fine-tune the language model so its outputs score well on the reward model.

The result is a model that behaves differently from the raw language model. It follows instructions. It refuses certain requests. It adopts the voice of a helpful assistant. The base model had all of these behaviors latent in the distribution of human text. RL pulled a specific behavioral pattern out of the distribution and made it dominant.

This is the synthesis. Pretraining provides the representations: the shape of human knowledge, the structure of language, the implicit model of how humans reason. RL provides the shaping: which behaviors, among all those the model could exhibit, it should exhibit.

Neither alone produces what we recognize as a language agent. A raw pretrained model will complete text in whatever direction seems statistically plausible. It will not follow instructions reliably. It will not refuse harmful requests. It will not act like an assistant. A reinforcement-learned system without pretraining cannot reason about open worlds at all, because it has no representations to reason with. The combination produces something new.

This is the inheritance. Not the algorithm itself, although that matters. The capacity to shape behavior on top of pretrained representations. A way to move a model from “does whatever is statistically likely” to “does what we ask in the ways we want.”

What RL Does Inside Language Agents

RL’s role inside language agents is not a single job. It has at least three, and the third is newer than most industry observers realize.

The first is alignment. This is what InstructGPT did, what Constitutional AI does, what RLHF and its descendants do in every modern language model pipeline. The model is trained to prefer helpful, honest, harmless responses over the alternatives. Anthropic’s RLAIF uses AI-generated feedback in place of human labels, which lets the technique scale. Direct Preference Optimization skips the reward model and optimizes preferences directly. These are variations on the same move: shape a language model’s behavior toward preferred outputs using learned or collected preferences.

The second is reasoning. In late 2024, OpenAI released o1, a language model trained to produce extended chains of thought before producing answers. DeepSeek-R1 followed in January 2025 and showed the technique could be reproduced in open weights. The DeepSeek team named their variant Reinforcement Learning with Verifiable Rewards, or RLVR. Instead of training against human preferences, RLVR trains against automatically checkable signals: did the math answer come out right, did the code compile and pass tests. The reward is cheap and accurate, which means the training can run at scale.

The result is a new category of model, sometimes called a large reasoning model. The architecture is the same as the language models the reasoning models are built from. The training recipe differs. A base model is exposed to verifiable problems, generates multiple reasoning traces, and is reinforced for traces that arrive at correct answers. Over enough training, the model develops what the DeepSeek paper calls emergent reasoning patterns: self-reflection, verification, dynamic strategy adaptation. These are not hand-coded. They fall out of rewarding correct final answers on problems hard enough that naive approaches do not suffice.

Chain-of-thought prompting asks a model to reason step by step. Chain-of-thought training teaches a model that reasoning step by step pays off. The difference is the difference between a hint and a habit. A prompted model can produce chain-of-thought output, but whether it actually reasons through the chain or hallucinates a plausible-looking one depends on luck. A trained model has been shaped, over thousands of RL steps, to treat extended reasoning as the default approach to hard problems. The reasoning is not always correct. But it is no longer optional.

The third is agentic behavior. Coding agents, web-browsing agents, tool-using agents. All of them are post-trained to exhibit the behaviors we call agentic. Stay on task. Use tools correctly. Recover from errors. Maintain goals across steps. Each of these is a behavior that RL-style optimization against a carefully chosen reward can produce, and which a pretrained model alone will not reliably exhibit.

This is visible in specific cases. Claude Code and similar coding agents show behaviors the underlying language models do not exhibit out of the base. They invoke tools in a specific call format. They wait for tool results before continuing. They interpret error messages and adjust course. They run tests and use the outputs to decide what to do next. These behaviors sit on top of the base model’s knowledge of code, but they are not automatic from that knowledge. They are trained in. The specific way a frontier coding agent uses its tools, the exact shape of its correction loop, the cadence of its status updates: all of this is the product of post-training choices that differ from lab to lab.

There are other roles. Reward models are used as filters in inference. Safety training leans on preference data. Fine-tuning for specific industry use cases often blurs into RL territory. The pattern across all of these is the same. Pretraining built a base of capability. RL shapes the base into something behaviorally specific.

The Transformation

What changed between Era 2 and Era 3 is not whether RL is used. It is what RL is applied to.

In Era 2, RL was applied to a blank agent. Start from nothing. Learn a policy from scratch. The agent’s entire competence came from the RL process. This is why Era 2 worked in closed environments and failed in open ones. There was no prior knowledge to start from.

In Era 3, RL is applied to a pretrained model. The model already has competence. It already has representations of the world. It already has implicit models of reasoning, planning, and language. RL does not build the competence. It shapes it.

This sounds like a technical detail. It is actually the whole story.

Consider the same RL algorithm applied in each setting. Proximal Policy Optimization, the standard algorithm used for RLHF, is also a standard algorithm used in Era 2 robotics and game-playing RL. The algorithm is the same. The difference is what it operates on. Applied to a neural network starting from random weights, PPO can learn to play Atari if the environment cooperates, and nothing more. Applied to a pretrained language model, the same algorithm can turn a text completer into an instruction follower, an answer generator into a reasoner, a language model into an agent. The algorithm did not gain new powers. The substrate did.

A technique that fails in open worlds because it has no prior knowledge becomes powerful in open worlds when you give it prior knowledge. The prior knowledge comes from pretraining. The shaping comes from RL. Neither alone produces today’s agents. Together they produce agents that exhibit behaviors neither could produce on its own.

This is why the second wall fell when it did. Not because RL was replaced. Because RL acquired a foundation it had never had before: a pretrained language model with open-world competence already inside it.

What This Tells Us About Language Agents

If RL is inside language agents, not beneath, not beside, then several things follow.

First, language agents are composite systems, not monolithic ones. When an agent does something surprising, the cause could sit in the pretrained weights, in the RL-shaped behaviors, in the prompt, in the harness, or in the interaction between all of these. Debugging requires distinguishing which layer produced the behavior. This is part of why language agent behavior is famously hard to reason about. The layers interact, and they interact in ways no layer alone predicts.

Second, the capabilities of language agents are not the capabilities of pretrained language models. Pretraining provides the raw material. RL turns the raw material into an agent. When industry observers marvel at what modern agents can do, they are marveling at something produced by a specific post-training process. Different RL choices produce different agents from the same base model. Claude’s persona is the product of Anthropic’s post-training. GPT’s persona is the product of OpenAI’s. The base models differ less than the personas suggest.

Third, the bottlenecks of language agents are partly RL bottlenecks. Getting an agent to follow complex instructions, to refuse certain requests, to maintain specific values, to reason in specific patterns. All of these are RL-shaped. They improve when post-training improves. When a lab claims a new model is better at some agentic task, the improvement is often an RL improvement, not a pretraining one. The base models have converged. The post-training is where the differentiation happens now.

Fourth, the limits of language agents are partly RL limits. What RL can shape is what we can specify a reward for. Alignment works because “helpful, honest, harmless” can be approximated by human preference data. Reasoning works because correctness can be checked automatically on math and code. Agentic behavior works because tool-use success has measurable outcomes. What RL cannot shape well is anything where we cannot construct a reward signal. This becomes important later in the series.

The External Inheritance

Part 3 turns to the other inheritance of the language agent era: the ReAct loop. If Part 2 is about what RL gave language agents internally, Part 3 is about what ReAct gave them externally. The capacity to interleave reasoning with action, observe results, adjust course. The loop that makes language agents visible as agents rather than text generators.

Deliberate, act, observe, deliberate again. This structure predates language models by decades. It sat in symbolic agent architectures, in BDI systems, in robotics control stacks. What changed in 2022 was what the loop ran on. Pretraining plus RL plus the ReAct loop is the technical stack of every Era 3 agent. Part 3 looks at what made the loop work where earlier versions of it had not.

The Shape of the Thing

Language agents are not large language models with tools bolted on. They are pretrained models shaped by reinforcement learning, running in a reasoning loop, embedded in a harness. Each layer matters. Missing any one of them, the agent does not exist in the form we have come to know it.

Part 1 said language agents broke the second wall by inheriting the open-world competence of models trained on almost everything humans have written. That is true, but incomplete. The open-world competence comes from pretraining. The agentic character comes from RL. The wall fell to a combination, not a component.

The intention gap, at the top of the diagram, may or may not yield to better combinations. That is a question for later in the series. For now, it is enough to see that the combination is what matters. RL in agent research did not die. It moved inside. And the inside is where the interesting structure has been hiding.

The Rise of Agents is an eight-part series. Next, Part 3: “The ReAct Moment.”

The Beauty of Mathematics, Part 3: The Moonshot

Hugo — Thu, 23 Apr 2026 11:42:56 GMT

This 3-part deep-dive series explores a four-hour interview with Carina Hong, hosted by Xiaojun Zhang in 2026. To better convey the depth and ideas of the conversation, I have reorganized the narrative, added background context, and clarified some of the more technical points.

Carina Hong describes her own company in terms that most founders have scrubbed from their talking points before the pitch meeting begins. “The outcome is binary,” she says. “Either we land the moon, or we do not. There is no middle.” She says this in multiple registers across the interview, sometimes analytically, sometimes with a slight shrug, and always with the calm of someone who has done the math on what happens if the rocket fails. She was twenty-four when she founded Axiom Math in July 2025. She turned twenty-five while raising $200 million at a $1.6 billion post-money valuation in March 2026, led by Menlo Ventures. The business model, which she told her seed-round investors she did not yet understand, remains an open question. What is not open is her willingness to say so aloud.

The first two parts of this series unpacked what Axiom is building and why. This part turns to the person building it. Not because biography validates thesis, but because the path she took to reach this bet is unusual enough to be worth tracing, and because the specific way she tells the story, candid about the fear and the uncertainty and the days she feels the stupidest in the room, is rare in Silicon Valley. The version of the AI-founder archetype that dominates the discourse is confident, messianic, and careful. Hong is intense, honest about herself, and not careful at all.

The Doodle

Hong grew up in Guangzhou. Her walk to primary school was ten minutes, long enough to get lost in thought, which she often did. She attended South China Normal University’s affiliated high school, a feeder for Chinese academic olympiads. She competed in math olympiads. She was, by her own account, consistently unable to solve the first problem on each round, the Euclidean geometry question that the other strong students regarded as automatic. She compensated, as described in Part 2, by grinding through with complex-number coordinates. It took her two to three times as long as her classmates and produced solutions that were correct but inelegant. She lost time. She sometimes skipped later problems. She kept doing this for years.

Somewhere in this period, she started doodling “MIT” on her mathematics scratch paper. The letters, she explains with characteristic flatness, were easy to draw. “If I had wanted to go to another school, say Columbia, it would have been more letters to write. MIT is three.” She had seen Good Will Hunting. She knew MIT was where the Infinite Corridor was. She knew it was where great mathematicians and physicists and astronauts came from. The doodle became a small ritual.

There is a framework, introduced to her later by a mentor, that Hong now uses to describe her childhood mental state. It comes from neuroscience. “Bounded attention” is the attention that is focused on a task. “Free attention” is the attention that wanders. Most children, most of the time, alternate between the two. Hong’s walk to school, the daydream in class, the complex-number method itself, she now describes as free attention applied to whatever was in front of her. It felt like play. It produced persistence as a byproduct.

She got into MIT. She majored in mathematics and physics. She did not, she notes repeatedly, feel like the smartest person in any room she entered there. MIT is a school where ordinary mathematical talent looks blunted next to the kind of student who is completing Knots and Surfaces in freshman year. Hong reports, without obvious bitterness, that she felt in every phase of her life that she was the stupidest person in that environment, the one who tried hardest and saw the least.

“I kept trying things that did not work,” she says. “And I kept trying them.”

The Cure

MIT changed her not through its curriculum but through its atmosphere. “What is hard, do that,” she says, summarizing what she took from the place. “What is painful, do that. What requires long-term thinking, do that.” She describes the school’s culture of suffering with something like affection. Students running in the middle of blizzards when the red weather alert said stay inside. A peer group, she says, “every one of whom could endure.”

Then the pandemic hit. Hong had been a freshman. In her second semester, MIT sent everyone home. Her small team of friends and study partners, which she had relied on as her emotional infrastructure, dissolved into Zoom rectangles. She was alone in an apartment, taking the same hard classes, but without the group that had absorbed the pain alongside her.

She had to find something else. What she found, or so she tells the story now, was the ability to extract meaning from difficulty itself, without a peer group to share it with. “The learning curve was very steep,” she says. “MIT really shaped my character.” The phrasing is restrained. What the phrasing covers is a transition she emphasizes carefully in the interview: from someone who could tolerate suffering within a community to someone who could find something useful in suffering alone. The second is the traitful pattern that investors later told her they look for. “Founders with a chip on the shoulder,” one of them said to her. “Chips on the shoulder convert into chips in the pocket.” She heard this phrase for the first time as a VC cliché and has since stopped finding it cliché.

She now says, with some amusement and no denial, that she is addicted to pain. “A lot of founders I know are,” she says. “It is not necessarily healthy.”

The Detour

She graduated from MIT with the double degree and, for reasons that in retrospect look like test runs, followed a path through the kind of elite credentialing that mathematicians sometimes take before settling into research. A Rhodes Scholarship at Oxford, where she completed a master’s in neuroscience at Hertford College with distinctions. Research at University College London’s Sainsbury Wellcome Centre and Gatsby Computational Neuroscience Unit. A Knight-Hennessy Scholarship at Stanford to pursue a combined J.D./Ph.D. in mathematics. Along the way, the Morgan Prize, the Schafer Prize, and nine peer-reviewed publications by the age of twenty-four.

She also spent a period as a quantitative trader. Not long, but long enough to form a view about what she did not want to do. The view is specific. In quantitative finance, she notes, your signal arrives quickly. You are right or wrong within a trading day, a week, a month. The feedback loop is so tight that it eats your own epistemic judgment. “You lose the ability to think about long-term things,” she says. “Competition in a short time horizon creates mediocrity.” The phrasing is more compressed in her voice than it looks on the page. The lesson she took from the quant period, the lesson she says she thinks about now, is that she wanted to work on problems where the signal was slow enough to let her think.

She was in her first year at Stanford, enrolled in both the law school and the mathematics Ph.D. program, when she decided to stop.

The Conversation

The founding story, like most founding stories, is cleaner in retrospect than it was in the moment. The short version, the one that appears in secondary coverage, is that Hong met Shubho Sengupta over coffee in Palo Alto, talked for a few hours about whether AI could be a mathematician, and started a company. The version Hong tells in the interview is longer by about a year and a half.

Sengupta, who became Axiom’s chief technology officer, is a generation older than Hong. He led Meta FAIR teams that developed OpenGo and CrypTen. Before Meta, he worked on distributed training systems that shaped Google Brain and was among the earliest CUDA developers at Nvidia. He has the kind of resume that, in Silicon Valley, opens doors without a further explanation of who the person is. None of this was known to Hong when she met him. They met at Verve, a coffee shop in downtown Palo Alto, where Hong was a law-school regular lugging three-volume constitutional-law casebooks and ordering matcha to watch the dogs in the courtyard. Sengupta was also a regular. They ended up at the same six-person communal table. The first conversation, as Hong remembers it, began when she asked him to close a blind because the sun was in her eyes.

They were friends for a year and a half before either of them mentioned starting a company. Hong did not know Sengupta worked at Meta. Sengupta knew she was a Stanford law and math student but not that she had a research background. They talked about the history of science, the papers Terence Tao was writing on formal proof, the Lean community, the question of whether formal verification was finally ready for AI. “We did not talk about a company,” Hong says. “We talked about the hypothesis. Could we build an AI mathematician.”

The decision itself came in the fall of 2024. Hong had just started her math Ph.D. at Stanford and was spending the other half of her time working at XTX, a quantitative finance firm with the kind of compute budget that made her realize how fast AI-for-math could move outside a university. One morning, after a run, she walked into Verve, sat down with Sengupta, and said: if we wanted to raise money for this, how much would we need. They worked out the answer on a napkin. By November she had decided. The formal fundraising would wait until Christmas, because the Christmas break was when Sengupta had time to read.

That reading group is the scene Hong mentions with the most enthusiasm. The two of them, over the break, went through what Hong calls “the Christmas reading package,” a set of papers they had assembled themselves. One of them was a survey co-authored by Kaiyu Yang and Gabriel Poesia, titled “Formal Mathematical Reasoning: A New Frontier in AI.” Hong had read across the AI-for-math literature before but had never seen the landscape laid out as a single connected surface. The survey’s fifth section proposed a set of capabilities a “good AI mathematician” ought to have, organized into a two-dimensional grid.

“I took the grid,” Hong says. “I traced every paper the survey cited, and every paper those papers cited. About half the citations I had not read. By the time I finished, the field had gone from five or six separate things I knew about to a single picture.”

The picture was the thesis. The next question was who else to bring.

The Mathematician Who Refused

Hong’s hiring strategy, for the first six months of Axiom, was an explicit rule: no mathematicians on staff until employee number fifteen.

The reasoning was not that mathematicians were useless. The reasoning was that mathematicians from the research culture would import assumptions that would slow the team down. Axiom needed to scale. It needed to train models on enormous datasets. It needed to accept that a good research proof and a good training example were not the same object. A mathematician steeped in the craftsmanship ethic of pure math, one who treats each proof as a hand-made thing worthy of years of care, would resist the industrial scaling that an AI company required.

She knows this because she tried to hire one anyway.

Early in the seed round, Hong approached a researcher connected to a benchmark in formal mathematics, someone whose technical judgment was exactly what Axiom needed. She made an offer. He accepted. Then he withdrew. The reason he gave, in Hong’s telling: “I do not want to work on internet-scale datasets.”

“He saw math as craft,” she says. “Sushi made one piece at a time. We were going to ask him to scale everything. He was right to withdraw. And I was right to try.”

The experience shaped her hiring approach. When she eventually did start bringing mathematicians in, Ken Ono first, the criterion was not credentials. It was openness to a specific kind of adversarial collaboration. Hong’s phrasing is precise. “We want mathematicians who fight us.” The mathematicians Axiom hires are not there to do the AI work. They are there to build benchmarks the AI cannot yet solve, to point out where the system’s proofs are technically valid but mathematically unsatisfying, to produce the training-signal shape that a self-play loop could not produce on its own. Adversarial. Not antagonistic, but structurally oppositional. The mathematician’s job is to find the gap. The engineering team’s job is to close it. Then a new gap is found. The cycle iterates.

Ken Ono, the fifty-seven-year-old mathematician whose arrival generated the “tenured professor quits to work for twenty-four-year-old” headline in Chinese media, joined under this arrangement. Ono, as discussed in Part 2, was hired specifically as a conjecturer. But he was also hired, Hong says explicitly, as someone who was going to push back. “His job is to tell us when we are wrong. He tells us often.”

This hiring principle extends to everyone. Sengupta fights Hong on engineering choices. The benchmark-focused mathematicians fight the prover team on evaluation criteria. François Charton, the Meta researcher who first applied transformers to mathematics in 2019 and who came to Axiom after using LLMs at Meta to solve a century-old math problem and disprove a 30-year-old conjecture, fights on how the mathematical and the machine-learning cultures should interface. The adversarial rule is a culture, not a tagline. Hong does not pretend it is comfortable. “It is exhausting,” she says. “But it works.”

The Fundraising

The fundraising story, told in pieces across the interview, is the most revealing section for anyone who has ever raised money and suspected the process had an absurd theater to it.

“Nobody likes fundraising,” Hong says. “Nobody. If I could pay an AI a percentage to raise for me, I would.” The reason, she explains, is not difficulty. The reason is repetition. “You are a repeating machine. You say the same things to one investor that you said to the last. You get the same questions. You give the same answers. After three weeks you could record yourself and send the recording instead.”

She describes an elevated signal-to-noise experience with Howard Morgan, the co-founder of Renaissance Technologies and First Round Capital, currently the chair of B Capital, the firm that led her seed round. Morgan is eighty years old. He was an early user of the ARPANET, with machine number fifty on the network in the early 1970s. He has been an active investor for more than forty years. When Hong met him, she had been awake most of the night rewriting a paper rebuttal for an academic conference deadline. The Zoom call was wedged into a gap. She went into it tired and not particularly polished.

What she did not expect was that Morgan, who by that point in his career had heard thousands of pitches from thousands of founders, turned the tables. He did not ask her what her business model was. He told her what her business model was. He laid out, with more conviction than she had at the time, where Axiom’s commercial paths were and how they would unfold. “He was more optimistic about my company than I was,” Hong says. “That does not happen often with investors.”

This is the moment she cites as the one that converted fundraising from theater back into something like genuine conversation. Her anti-pitch style, she had realized, was working because it was unusual. “Most founders tell VCs things are a ten out of ten,” she says. “The VCs apply a discount. They end up hearing an eight. I told them we were a seven. So they applied a discount and heard nothing.” She laughs. “But the ones who liked that I told them it was a seven, those were the ones I wanted anyway.”

The competitive dynamics worked in her favor once the first offers arrived. From the first term sheet to the last, the price tripled. By the end, multiple firms were competing, and the process, which she had described initially as exhausting, briefly became interesting. “You meet people you would not otherwise meet,” she says. “Occasionally one of them changes how you see your own company.”

The Landscape

One of the questions the interviewer asks, and which every founder in AI has to answer, is why large labs do not do what Axiom does. OpenAI, Google DeepMind, Anthropic all have both the talent and the infrastructure. Several of them have teams working on formal math. Why is there space for a startup?

Hong’s answer is specific and not defensive. “They could do it,” she says. “But they will not, because the expected return on that talent in their current focus areas is higher.” OpenAI’s informal-reasoning work, she notes, is driven by a senior researcher’s personal ambition for scientific discovery. Google DeepMind has parallel teams, one on formal methods and one on informal, and AlphaProof is their public flagship. Anthropic uses Lean data as a reinforcement-learning reward signal but treats it as infrastructure, not as a core product direction. The structural fact, in her framing, is that a lab with a dominant commercial product cannot redirect its best people to formal verification. The opportunity cost is too high. A startup with no other product can.

The real competitor, she says, is Harmonic, a company co-founded by the Robinhood CEO that has raised $295 million at a $1.45 billion valuation. The two companies share the broad thesis that verification matters, and differ on architecture. Harmonic, Hong notes, has a founder whose attention is split between Robinhood and the startup, which creates a different dynamic. “Their energy is scientific ambition,” she says, in a phrasing that sounds like praise and is also a differentiator. “Ours is moonshot.” The distinction, as she draws it, is that a scientifically ambitious company can afford to explore. A moonshot company has to land.

She is also clear that the labs could eventually become partners rather than competitors. “OpenAI sub-contracts search to Bing,” she offers as a parallel. “A large lab focused on informal reasoning could eventually invoke a formal prover, built by someone else, when a formal proof is required.” This pattern, she predicts, is where Axiom’s commercial opening is largest: as the verification layer for code-generating AI systems built by other companies.

The Moonshot

The word she uses for her company’s ambition, the word that organizes the interview, is moonshot. She uses it in English, embedded in Mandarin sentences, because the Chinese translations do not carry the same connotation. She invokes SpaceX, deliberately. Rockets that either reach orbit or burn. Companies that either return to earth having completed the mission or crash trying. The binary outcome is structural. There is no small win.

“I believe in recursive self-improvement,” she says. “I think it is near-term achievable, and I think verification is the piece that unlocks it. If I am wrong, we do not get a partial version of it. We get nothing. If I am right, everything changes.”

The phrasing is not standard startup talk. Most founders hedge. Most founders describe their companies as asymmetric bets, the small chance of a huge win against the likely chance of a modest one. Hong does not. She describes her company as an all-or-nothing bet, and she asks investors to join her on that basis, and she seems perfectly aware that this framing selects for a specific type of capital and deselects everyone else. It also selects for her own continued commitment, which is a function she has thought about explicitly.

“If we fail,” she says, in response to a direct question, “I might go back to neuroscience. I want to understand the brain. Current brain-computer interfaces are nowhere close to what they need to be. That is a problem I think about.”

She says this easily. Not as a fallback. As a parallel life that exists in a different branch of the probability tree, and that she would find meaningful if she had to take it. The detail matters. A founder who has already made peace with what they would do if the company failed is a specific kind of founder. Not a desperate one. Not a reckless one. Someone who has accepted the binary and chosen the moonshot anyway.

Language Is the World

Near the end of the conversation, Xiaojun Zhang mentions that her production company is called “Language Is the World” Studios. The name is a philosophical position as much as a brand: that language is the medium through which reality is structured and communicated. She asks Hong what she thought of the name when she first encountered it.

Her answer, at the end of four hours of discussion about Lean, about Curry-Howard, about verification and conjecture and the elegance filter and the question of whether beauty can be trained, is characteristic. “Mathematicians,” she says, “have been writing code in natural language for thousands of years. That is what a mathematical proof is. Structured logical reasoning, expressed in English or Chinese or whatever natural language the mathematician writes in. The thing that is new in the last decade is that we now have a second way to write proofs, in formal languages like Lean, and we can run them through a compiler. What we are doing at Axiom is building the bridge between the two ways of writing.”

The remark is offered in passing. It is also, in a sense that connects back to Part 1 of this series, exactly the thesis of the series. Math is code. Code is math. The reason this is not a metaphor is that natural-language mathematics and formal mathematics are two expressions of the same object. Hong’s company is betting that the translation from the first to the second, and the closed-loop verification that becomes possible when both exist, is the missing capability of the current AI paradigm. The word for that capability is proof. The word for the company is Axiom, which is the name for the starting point of a formal system, the smallest set of assertions from which the rest of mathematics can be derived.

She is twenty-five years old. She believes the closed loop can be built, that recursive self-improvement is near-term, and that Axiom’s binary outcome will be known within a few years. She has raised $1.6 billion worth of capital from investors who agree with her on enough of the thesis to fund the attempt. She grew up walking to a primary school in Guangzhou, doodling three letters on her scratch paper because those three letters were the shortest path from where she was to where she wanted to be.

The rocket is built. The launch window is open. Whether it lands remains to be seen.

This concludes The Beauty of Mathematics series. For earlier parts: Part 1, “Math Is Code”; Part 2, “Proofs from The Book.”

The Beauty of Mathematics, Part 2: Proofs from The Book

Hugo — Wed, 22 Apr 2026 13:53:56 GMT

The Beauty of Mathematics, Part 1: Math Is Code

Hugo — Tue, 21 Apr 2026 16:41:00 GMT

The Rise of Agents, Part 1: The Three Walls

Hugo — Mon, 20 Apr 2026 17:01:08 GMT

Agent research has built three kinds of agent in sixty years. Each kind got further than the one before. Each kind hit a wall. The third wall, which language agents are hitting now, is different from the first two. It is not a problem of technology. It is a problem of whether the technology is even pointed at the right thing.

This is Part 1 of a series a…

Smarter Than the Cage

Hugo — Fri, 17 Apr 2026 17:01:57 GMT

Four of science fiction’s most iconic AI characters come from different decades and different media: HAL 9000 in 1968, Asimov’s robots across the 1940s to 1980s, Roy Batty in 1982, Ava in 2014. Each one predicts a different AI safety failure. HAL demonstrates deceptive alignment. Asimov’s robots demonstrate rule-based safety collapsing under its own log…

Intelligence Is Compression, Part 7: What We Still Don’t Understand

Hugo — Thu, 16 Apr 2026 09:59:33 GMT

“The study is to proceed on the basis of the conjecture that every aspect of learning or any other feature of intelligence can in principle be so precisely described that a machine can be made to simulate it.” – Proposal for the Dartmouth AI program, 1956

This is Part 7 (final) in a seven-part series exploring the core ideas in Principles and Practice of…

Intelligence Is Compression, Part 6: When Theory Meets the Real World

Hugo — Wed, 15 Apr 2026 17:01:22 GMT

“The best theory is inspired by practice, and the best practice is inspired by theory.” – Donald Knuth

This is Part 6 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed represe…

Inside China’s Machine: AgiBot

Hugo — Wed, 15 Apr 2026 11:07:37 GMT

On March 30, 2026, AgiBot (智元机器人) rolled its 10,000th robot off the production line in Shanghai. The company marked the occasion with a ceremony, a press release, and a quote from co-founder Peng Zhihui that cut through the celebration: “Scale is not about whether you can do some moves. It is about whether you can work 24 hours straight in a factory.”

Th…

The LeCun Bet: The Evidence Sharpens

Hugo — Tue, 14 Apr 2026 16:52:39 GMT

In March, I published The LeCun Bet, an analysis of what AMI Labs’ $1.03 billion seed round actually buys. The bet operates on two levels. Against the generative camp, it is a rejection of the medium: predicting pixels wastes capacity on irrelevant detail, and understanding requires predicting in abstract representation space instead. Against the Dreame…

Intelligence Is Compression, Part 5: Building the White Box

Hugo — Mon, 13 Apr 2026 09:10:39 GMT

“What I cannot create, I do not understand.” – Richard Feynman

This is Part 5 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major architecture…

Intelligence Is Compression, Part 4: The Information Game

Hugo — Mon, 13 Apr 2026 07:02:45 GMT

“We compress to learn, and we learn to compress.” – High-Dimensional Data Analysis, Wright and Ma, 2022

This is Part 4 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed repres…

Intelligence Is Compression, Part 3: Learning by Denoising

Hugo — Sun, 12 Apr 2026 17:01:54 GMT

“Information is the resolution of uncertainty.” – Claude Shannon, 1948

This is Part 3 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, learning compressed representations, unifies the major arch…

Robots from Sci-Fi: The Perfect Manipulator

Hugo — Sat, 11 Apr 2026 17:47:41 GMT

The lights go out.

It happens during one of Caleb’s interview sessions with Ava, the humanoid AI he has been brought to evaluate. The cameras die. Nathan’s surveillance goes dark. For the first time, they can speak without being monitored. And in this private window, Ava leans forward and says five words that change the trajectory of the film: “Nathan is…

Intelligence Is Compression, Part 2: What Does a Neural Network Remember?

Hugo — Fri, 10 Apr 2026 14:56:48 GMT

“The art of doing mathematics consists in finding that special case which contains all the germs of generality.” – David Hilbert

This is Part 2 in a seven-part series exploring the core ideas in Principles and Practice of Deep Representation Learning, a new open-source textbook from Yi Ma‘s group at UC Berkeley. The book argues that a single principle, l…