Inside China’s Machine: DeepSeek V4
Two Open-Weight Models. Eight Chip Families. One Frontier Co-Engineered for Non-Nvidia Silicon. The Stack Was the Moat. Now It Has a Fork.
The most-quoted line about DeepSeek V4 came from Jensen Huang on the Dwarkesh Patel podcast a week before the model launched. Asked about reports that DeepSeek’s next frontier model would run on Huawei Ascend chips rather than Nvidia GPUs, Huang said it would be “a horrible outcome for America.” The financial press treated this as another China-AI-race headline. The technical press treated it as Huang’s predictable defense of Nvidia’s market position. Both readings missed what Huang was actually saying.
The threat Huang named was not that China can build good models. China has been building good models since DeepSeek R1 fifteen months ago. The threat was that good models might no longer use CUDA as their default optimization target. Nvidia’s moat is not the silicon. The silicon is replicable. The moat is the twenty-year compounding of CUDA: five million developers, every textbook example written against it, every PhD student trained on it, every framework built around it. A Chinese frontier model trained outside that ecosystem does something more structurally important than match Western performance. It demonstrates that the ecosystem can fork.
DeepSeek V4 launched on Friday, April 24, 2026. Two preview versions, both open-weight under MIT license. V4-Pro at 1.6 trillion total parameters with 49 billion active. V4-Flash at 284 billion total with 13 billion active. Both default to a one-million-token context window. Both ship with day-zero inference support across Huawei’s Ascend 950PR supernodes. Day-zero is the part that matters. Eight domestic Chinese chip families completed V4 adaptation simultaneously through BAAI’s FlagOS national AI software stack. Within hours, Cambricon, Hygon, Moore Threads, Suiyuan, and four other Chinese accelerator vendors confirmed native support. Alibaba, ByteDance, and Tencent had pre-ordered hundreds of thousands of 950PR units in the weeks before launch, pushing chip prices up twenty percent. The day DeepSeek shipped V4, China’s domestic AI compute ecosystem was already coordinated to receive it.
This is the story of what actually happened, why it took DeepSeek fifteen months instead of three, and what it means that the Chinese AI stack now offers the only frontier-model deployment path with a credible route to Nvidia independence.
What the Tech Report Actually Says
DeepSeek’s fifty-eight-page technical report, released alongside the model weights on Hugging Face, is more honest than most of the coverage of it. The report states that V4 was trained with parallel verification on both Nvidia GPUs and Huawei Ascend NPUs. Parallel verification means the two platforms produced numerically aligned results during training, not that V4 was trained twice. The economic cost of duplicate frontier training runs (more than $500 million per run, by some reports) makes that physically implausible. What parallel verification did was establish Ascend as a target platform that could be trusted to reproduce CUDA-derived results, with Nvidia serving as the ground-truth baseline. Huawei’s own announcement says its chips were used for a portion of V4-Flash training. The bulk of V4-Pro training, the 1.6-trillion-parameter model, almost certainly ran on Nvidia GPUs at peak capability. The 950PR’s role at launch is inference, not training. The 950DT, Huawei’s first Ascend chip optimized for both decoding and training, ships in Q4 2026. The 950DT will reduce but not eliminate Nvidia’s training-side advantage. Single-chip FP8 performance stays at 1 PFLOPS, the same as the 950PR and roughly a quarter of Nvidia’s B200, with the difference being HiZQ 2.0 memory at 144 GB and 4 TB/s for sustained-bandwidth training workloads. Huawei’s announced roadmap targets full single-chip parity with Nvidia only by 2028 with the Ascend 970. The intermediate Ascend 960 (Q4 2027) targets parity with Blackwell, which by 2027 will already be one generation behind Nvidia’s then-current chip.
The truthful framing: V4 is the first frontier-class model co-engineered for Chinese silicon, not the first trained entirely on Chinese silicon. The distinction matters because it tells you what stage the fork is at. Training of the largest models in the Chinese AI stack still depends on Nvidia for peak capability and on Nvidia as the verification baseline. Inference has a credible path to Ascend independence over the next twelve months, though the full switchover waits on 950PR’s at-scale shipments in the second half of 2026. For a model whose economic value at deployment depends mostly on inference cost, the inference-side independence is meaningful even when the training side is not yet free.
The architectural choices reveal how DeepSeek made it work. The report introduces five innovations:
Hybrid attention. V4 combines Compressed Sparse Attention, DeepSeek Sparse Attention, and Heavily Compressed Attention. CSA dynamically compresses key-value entries before computing attention. DSA sparsifies the resulting attention matrices. HCA aggressively consolidates KV entries across token sets. The net effect: 73 percent fewer per-token inference FLOPs than V3.2, and 90 percent less KV cache memory at one-million-token context. NVIDIA’s own technical analysis confirmed these numbers when integrating V4 into Blackwell.
Manifold-Constrained Hyper-Connections. Standard transformers use residual connections that lose information in deep networks. V4’s mHC confines gradient flow to specific geometric manifolds, which the report describes as “a flexible and practical replacement for residual connections.”
Engram Conditional Memory. V4 separates factual memory from computational reasoning. Engram provides O(1) knowledge retrieval, which lifts needle-in-a-haystack accuracy at one million tokens from 84.2 percent to 97 percent in DeepSeek’s benchmarks. The report identifies a U-shaped scaling law: reallocating 20-25 percent of sparse capacity from MoE experts into Engram memory optimizes overall performance. This is the first production model to formalize “conditional memory” as a sparsity axis distinct from “conditional computation.”
Native FP4 quantization-aware training. V4 trains directly in FP4 precision. The Ascend 950PR has hardware-native FP4 support, which means no precision conversion overhead and seventy-five percent memory reduction per weight. The chip and the model are precision-matched at the silicon level. This is not coincidence. DeepSeek and Huawei co-designed for this.
Muon optimizer. Replaces Adam-based optimizers with a more aggressive convergence strategy that lets V4 train on 33 trillion tokens within a budget that earlier optimizers would have required substantially more compute to handle.
The integrated effect is the cost structure that matters. V4-Pro’s input price is 1 yuan per million tokens. V4-Flash is 0.2 yuan. The same agentic coding workload that costs $30 per million tokens on a US frontier API costs $3.48 on V4-Pro and under one dollar on V4-Flash. Pricing this aggressive only works if the model actually costs less to run, which the architectural innovations make true rather than theatrical.
The Migration That Took Fifteen Months
The 36Kr investigative report on V4’s delay is the most useful Chinese-language source on what actually happened during the fifteen months between R1 and V4. The reporting traces the silence to two converging causes: a serious training failure in mid-2025, and a strategic decision to migrate the training framework from Nvidia CUDA to Huawei CANN.
The migration was an order of magnitude harder than the public framing suggested. According to engineers close to DeepSeek, the most time-consuming part was not rewriting operators. It was aligning numerical precision so that the same model produced the same mathematical results on Nvidia and Ascend platforms. When DeepSeek attempted training on the Ascend 910C, the 1024-card cluster’s gradient synchronization timed out. The older CANN release lacked key operators, which produced training instability. The 950PR addressed both issues: inter-chip bandwidth tripled, CANN Next built FlashAttention and PagedAttention into the framework natively. Liang Wenfeng’s technical demands during this period were reportedly difficult to translate into implementation, and internal disagreements about the training direction slowed progress further.
The cost of this migration was visible in what V4 is not. V4 ships text-only. The multimodal generation and understanding capabilities that DeepSeek had targeted were postponed to a future release, the report states, because of compute and cash constraints from the Huawei migration. The talent bench thinned during the same period: Luo Fuli, a core V3 architect, left for Xiaomi to lead spatial intelligence. Guo Daya, the lead author on R1’s GRPO algorithm, joined ByteDance’s Seed team on a reported package that ByteDance denied was 100 million yuan annually but confirmed included equity. Wang Bingxuan, an early DeepSeek LLM author, went to Tencent. Ruan Chong, a multimodal researcher, joined Yuanrong Qixing. Headhunter accounts described offers at two to three times prevailing salary, with immediately priced stock options attached. DeepSeek could not match on the equity line because its equity had no price.
The fundraising decision in mid-April 2026 was a direct response to this. Liang Wenfeng spent two years rejecting outside capital. He turned down Tencent’s offer of a twenty-percent exclusive stake. The eventual round opened at a $10 billion valuation seeking $300 million. Five days later, The Information reported that talks with Tencent and Alibaba had pushed the figure above $20 billion. The stated purpose of the round, in the words of an investor familiar with Liang’s thinking, was not cash. It was to give DeepSeek’s employee stock options a market price. Without an external valuation, the equity that retained engineers required a number to anchor against. The twenty-billion-dollar tag is, in this reading, what retention costs.
The picture this assembles is of a research-first organization being pulled into commercial-company shape by forces that R1’s success generated. Doubao surpassed DeepSeek to become China’s number-one consumer AI app in August 2025, reaching 331 million monthly active users by March 2026. DeepSeek experienced an eleven-hour outage in late March that trended on Chinese social media. Liang began paying attention to product refinement. DeepSeek’s HR began contacting Chinese-language students at Peking University to do humanities-domain data annotation. The April 8, 2026 redesign of the DeepSeek app introduced Expert Mode for complex reasoning and Fast Mode for simple tasks, mapping directly to V4-Pro and V4-Flash. The company spent the V3-era idealism, and the V4 release was the first product of the company DeepSeek became after spending it.
The Performance Picture
V4-Pro’s headline benchmark is SWE-bench Verified at 80.6 percent, within 0.2 percentage points of Claude Opus 4.6. DeepSeek’s tech report claims V4-Pro beats all open-weight models in agentic coding, beats Claude Sonnet 4.5 on internal agentic coding evaluation, and approaches Claude Opus 4.6 in non-thinking mode. On Codeforces competitive programming, V4-Pro scores 3,206, ranking 23rd among human competitors. On Humanity’s Last Exam, the score jumps from 7.7 in non-thinking mode to 37.7 in thinking mode.
The honest reading of these numbers requires distinguishing categories. V4 is open-weight. Claude Opus 4.6, GPT-5.4, and Gemini 3.1 Pro are closed API models. Same market, same maturity stage, but different commercial models and different distribution channels. V4 ships weights you can download, modify, and run locally. The closed competitors do not. For agentic coding workloads, the price differential is decisive: cache-hit V4-Flash input pricing at $0.028 per million tokens is roughly ninety times cheaper than equivalent Claude Sonnet output, while Vals AI’s Vibe Code Benchmark ranked V4 as the leading open-weight model.
Within open-weight competition, the picture is denser. A Zhihu evaluation found V4 not clearly superior to Zhipu’s GLM 5.1 or Kimi K2.6, both of which shipped while DeepSeek was silent. Zhipu and MiniMax explicitly accelerated their releases to avoid being overshadowed by V4’s timing. The day V4 launched, MiniMax stock fell 8 percent in Hong Kong, Zhipu fell 8 percent, and Manycore Tech fell 9 percent. Morningstar’s Ivan Su captured the implication: “DeepSeek’s latest positioning places other Chinese open-source models as direct competitors. This is a framing that didn’t exist with R1.”
The DeepSeek tech report is unusually candid on the gap. V4-Pro “falls marginally short of GPT-5.4 and Gemini 3.1 Pro, suggesting a developmental trajectory that trails state-of-the-art frontier models by approximately three to six months.” This is a sober acknowledgment that the frontier of intelligence is still set by closed Western labs, and that V4’s significance lies elsewhere.
The elsewhere is the stack itself.
The Stack Forks
Nvidia’s CUDA dominance has been the AI industry’s most durable infrastructure assumption since 2012. CUDA is what made Nvidia the operating system of AI training, more than the silicon underneath. Five million developers, the textbooks, the framework integrations, the implicit assumption in every AI research paper that the code will compile against CUDA: this is what Huang has spent a decade defending. The chip business is downstream of the software ecosystem.
CUDA has been challenged before. Google’s TPU runs through XLA. AMD has ROCm. Intel had oneAPI. None of these has broken CUDA’s grip on the frontier of training, because none has been the default optimization target for a frontier model that the rest of the industry then has to support. V4 changes the asymmetry. CANN Next, Huawei’s CUDA equivalent, now has a frontier model that was co-engineered for it. Adding a SIMT programming model that compiles CUDA-style code directly for Ascend lowers the migration barrier for developers already trained on CUDA. The four million CANN developers Huawei reports is still smaller than CUDA’s five million plus, but the trajectory matters more than the level. A frontier model that ships first on a non-Nvidia stack is the kind of event that pulls the developer base.
The market response in the days following V4’s launch revealed which actors believed this. SMIC, the Chinese chipmaker that fabricates Huawei’s Ascend processors, jumped 10 percent in Hong Kong trading. Cambricon’s stock continued a multi-month rally driven by ByteDance’s reported $22 billion 2026 AI infrastructure budget, of which Cambricon is the largest beneficiary among domestic chip vendors. Domestic chip share in China climbed to over 40 percent of the 2025 AI accelerator market by IDC’s measurement, with 1.65 million units shipped. Nvidia’s China share fell from over 70 percent at peak to roughly 55 percent. The Tencent-Alibaba-ByteDance bulk pre-orders for hundreds of thousands of 950PR units, the price increase of twenty percent in the weeks before V4 launch, and BAAI’s day-zero adaptation across eight chip families together describe a domestic ecosystem that is no longer waiting on Nvidia’s roadmap.
The harder question is whether this generalizes outside China. Three constraints make the answer uncertain.
The performance gap is real. The 950PR delivers 1 PFLOPS at FP8. Nvidia’s Blackwell B200 hits 4.5 PFLOPS. Huawei is closing the gap through architectural innovation and FP4 hardware, but raw compute still shows a generational lag. V4’s compressed attention architecture cuts inference compute to 27 percent of V3.2’s, which is what allows the 950PR to host a frontier model in the first place. A model designed for brute-force scaling rather than efficiency might not replicate this path on Ascend.
The training side remains Nvidia-dominant. V4 used parallel verification on both platforms during training. The 950DT ships in Q4 2026, but its single-chip performance stays at the 950 die’s 1 PFLOPS FP8 baseline. Huawei’s roadmap targets parity with Nvidia’s current generation only by 2028 with the Ascend 970. Until at least 2027, frontier training in China leans on the most advanced Nvidia hardware that export controls permit, supplemented but not replaced by Ascend at the bleeding edge. The dependency has shifted from total to partial, not from total to none, and the partial-to-none transition is a multi-year process.
The CUDA ecosystem has decades of compounding. CANN’s four million developers, the SIMT compatibility layer, the FlashAttention and PagedAttention native support: these reduce the migration cost but do not eliminate it. The library depth, the tooling maturity, the Stack Overflow corpus, the years of accumulated debugging knowledge are non-fungible. Migration from CUDA, even with the best compatibility layer, will be a multi-year undertaking for any large codebase.
These constraints argue against the strongest version of the “stack forks” claim. The weaker version, which the evidence supports, is that there is now a credible alternative training-and-inference path that did not exist twelve months ago, and that the path is robust enough to host frontier models. CSIS analysts framed the implication directly: if V4 achieves frontier performance on Ascend silicon, the premise that restricting Nvidia exports can slow Chinese AI development is no longer correct. The European Union Institute for Security Studies described DeepSeek’s emergence as “the beginning of AI’s multipolarization.” Both are true. Neither implies that the multipolar world is symmetric.
What This Pattern Reveals
DeepSeek R1 in January 2025 demonstrated that frontier capability did not require frontier compute. That was a pricing argument: clever architecture could substitute for raw scale. The implication was that AI capex assumptions priced into hundreds of billions of dollars of US infrastructure investment had a thinner moat than anyone acknowledged.
V4 demonstrates something different. The argument is no longer about capex. It is about the stack. Frontier capability does not require Nvidia compute. The Chinese alternative stack is now functional end-to-end at the inference layer and partial-but-rising at the training layer. Three implications follow.
For the US export-control framework. The strategy assumed Chinese AI development could be slowed by restricting Nvidia hardware. V4 makes this assumption visibly false at the inference layer and structurally weakening at the training layer. The policy options narrow to two: escalate controls to target Huawei silicon and CANN software directly, or rethink the framework. The first path is technically possible but politically and diplomatically expensive, since Huawei is not export-dependent on US technology in the way that earlier sanctioned firms were. The second path is what most non-US analysts now advocate, but it requires accepting that the strategy has not delivered.
For the open-weight ecosystem. The competitive structure within Chinese AI now resembles US open-weight competition more than US-vs-China competition. DeepSeek’s direct competitors as of April 2026 are Alibaba’s Qwen3, Zhipu’s GLM 5, MiniMax’s M2, Manycore’s Spatial Gen, and ByteDance’s Doubao. These are different categories of company. Doubao is a consumer-app-first product, Qwen is a hyperscaler open-weight family, MiniMax is API-plus-Hailuo product, Zhipu is enterprise-first, DeepSeek is research-first. The convergence onto compatible Huawei Ascend deployment removes the underlying compute fragmentation that previously justified separate strategies. Within the next year, choosing between Chinese open-weight models will resemble choosing between Llama 4 and Mistral Large in the West: different fine-tunes, similar capabilities, different distribution channels. V4’s 1 yuan per million input tokens establishes a low-end price anchor that the rest of the cohort will have to respond to.
For Western open-weight strategy. Meta, Mistral, and Cohere are now competing not just against Chinese frontier capability but against Chinese frontier capability plus a deployment stack at roughly an order of magnitude cheaper inference pricing. The structural advantage of open-weight Western labs has historically been ecosystem maturity: PyTorch, Hugging Face, the developer community. That advantage compresses each year. Whether Western open-weight can hold the line depends on factors largely unrelated to model capability: what happens to Nvidia’s pricing as Chinese competition emerges, what happens to inference cloud costs as alternative silicon scales, what happens to enterprise procurement as deployment portability becomes a buyer requirement.
The Founder Who Stopped Saying No
The most underweighted element of the V4 story is what it cost Liang Wenfeng to make it happen.
The R1-era DeepSeek was a research lab that happened to ship. Liang ran it from High-Flyer’s profits, paid researchers without urgency, kept commercial pressure away from the bench. His public statements emphasized that VCs need returns, that capital corrupts research culture, that DeepSeek would not raise. The model worked because High-Flyer’s quantitative trading produced 56.6 percent returns in 2025, which generated enough cash to fund an AI lab without requiring it to ever justify itself to outside investors.
The V4-era DeepSeek is a company. Liang accepted external capital. He took meetings with Tencent and Alibaba for stakes that, even after refusing the largest single demand, will dilute High-Flyer’s near-total ownership. He let HR run open-door recruitment for product strategists, established internal product teams to explore agents, redesigned the consumer app, accepted that V4 would ship without multimodal capabilities he had wanted. The eleven-hour outage in March was followed by infrastructure spending. The talent exodus was followed by an equity-pricing fundraise. The pattern is clear: the company is becoming what Liang spent two years trying to avoid.
Whether this is loss or evolution depends on what you think DeepSeek is for. If you read the company as a research institution producing public goods through open-weight releases, the V4-era trajectory is a compromise of original purpose. If you read it as Liang’s own description, an attempt to develop AGI under organizational structure that maximizes research freedom subject to survival constraints, the V4-era trajectory is the second move in a game where the first move has stopped being available. The first move was rejecting capital while High-Flyer’s returns covered the budget. A frontier training run now costs more than $500 million by some reports. High-Flyer’s hedge fund profits, large as they are, cannot absorb that on an annual basis without becoming a different kind of fund. The math forced the choice.
The V4 release is the first product of the choice. It is also a demonstration that the choice can produce results that match or exceed what the prior structure produced, which is the only argument that retroactively justifies the choice to anyone who liked the prior structure.
The Significance Is Where the Hype Isn’t
The headline coverage of V4 has emphasized three things: low price, Huawei silicon, the threat to Nvidia. Each is true. None is the most important thing.
The most important thing is that the Chinese AI stack now exists as a coherent alternative deployment path, end-to-end, at frontier capability. Not symmetric to the US stack. Not yet superior on raw training capability. But coherent in the sense that you can choose it, build on it, ship in it, and the loop closes without requiring any non-Chinese component except the lithography stack underneath the silicon. Even there, China’s domestic manufacturing is climbing the curve.
This is a structural change, not a moment. R1 was a moment. V4 is the first move of a stack that intends to keep moving. The next chapters will be written by 950DT closing some of the training gap in 2026-2027, by Ascend 960 and 970 closing more of it through 2028, by FlagOS adapting to next-generation models, by Cambricon and Hygon catching up to Huawei in their respective niches, by Chinese open-weight labs converging onto a shared deployment substrate that does not require Nvidia.
For Western enterprises, the practical question is which side of this they will be procuring on by 2028. For Western policymakers, the question is whether the framework that assumed Chinese AI could be slowed by hardware controls survives the demonstration that it cannot. For Chinese AI labs, the question is which of them can compete in the market that DeepSeek has just made denser, and at what margin.
Jensen Huang said catastrophe. He chose the word carefully. The catastrophe he meant was not that V4 exists or that it runs on Huawei chips. The catastrophe was that the moat he spent twenty years building turns out to be less than indelible, and that the demonstration of this came from a Chinese lab that fifteen months ago was a side project of a quantitative hedge fund.
The stack was the moat. Now it has a fork.
Inside China’s Machine. China’s AI and robotics ecosystem, from the inside.
Sources
Launch and core specs: DeepSeek API documentation (official news260424); DeepSeek tech report on Hugging Face; CNBC (”China’s DeepSeek releases preview of long-awaited V4 model,” April 24, 2026); Fortune (”DeepSeek unveils V4 model, with rock-bottom prices,” April 24, 2026); Al Jazeera; Investing.com; ghacks Tech News.
Architecture: NVIDIA Developer Blog (”Build with DeepSeek V4 Using NVIDIA Blackwell”); kenhuangus Substack (”DeepSeek V4: The Next Frontier of Open-Source AI”); aitoolinsight (”DeepSeek Unveils V4 at Rock-Bottom Prices”); BigGo Finance technical report summary; remio.ai.
Huawei Ascend integration: South China Morning Post (”Huawei, DeepSeek strengthen China’s AI self-reliance”); Reuters (via Investing.com); Huawei Central; weijinresearch Substack on 950PR specifications and CANN Next; digitado.com.br.
Migration story and training failure: 36Kr investigative report (”DeepSeek V4 Released: Five Subjective Questions Remain Unanswered”); 36Kr “Jensen Huang Labels It a ‘Disaster’”; overnightai.substack.com summary of FlagOS day-zero adaptation across eight chip families.
Liang Wenfeng and fundraising: The Information (via Unite.AI, “DeepSeek Seeks First Outside Funding at $10 Billion Valuation,” April 17, 2026); Implicator.ai (”Tencent, Alibaba in Talks to Back DeepSeek at $20 Billion,” April 22, 2026); BigGo Finance financial logic analysis; Tech Startups; futunn.com summary of architectural targets and mid-2026 timeline.
Domestic chip ecosystem: IDC 2025 China AI accelerator market data via digitado; Counterpoint analyst Wei Sun via CNBC; ByteDance ¥160B 2026 infrastructure spend reporting via 36Kr.
Talent departures: Unite.AI; SCMP via Implicator.ai; BigGo Finance.
Performance benchmarks: DeepSeek tech report; Vals AI Vibe Code Benchmark; Zhihu evaluation summary via overnightai; Codeforces ranking via aitoolinsight.
Strategic context: CSIS analysis on export controls (cited in remio.ai); EUISS framing via remio.ai; Jensen Huang Dwarkesh podcast quote via 36Kr and digitado.
Classification: Architectural specifications and benchmark numbers from official tech report are Confirmed. Training migration details (mid-2025 failure, internal disagreements) are Reported per 36Kr’s “insiders” sourcing. Fundraising figures ($10B-$20B valuation range, $300M raise target, Tencent’s rejected 20% offer) are Reported per The Information sourcing. Talent compensation figures (Guo Daya $14M-equivalent package) are Reported and Denied; ByteDance confirmed equity inclusion but not the specific number. Multimodal capability postponement and consumer app product strategy are Reported per 36Kr Intelligent Emergence sourcing. Performance trajectory (”3-6 months behind GPT-5.4 and Gemini 3.1 Pro”) is DeepSeek’s self-reported framing in the tech report.


