The Beauty of Mathematics, Part 1: Math Is Code

The AI industry learned to generate. The next leap is learning to prove. A 25-year-old mathematician is betting $1.6 billion on the gap between the two.

Apr 21, 2026

This 3-part deep-dive series explores a four-hour interview with Carina Hong, hosted by Xiaojun Zhang in 2026. To better convey the depth and ideas of the conversation, I have reorganized the narrative, added background context, and clarified some of the more technical points.

On December 6, 2025, the Putnam Competition opened at 10 a.m. across North America. Over four thousand undergraduates sat down to attempt twelve problems over six hours. The median score has historically hovered near zero. Out of more than 150,000 attempts in 98 years, only five humans have ever scored perfectly.

At the same moment, in a conference room named after Poincaré in Palo Alto, an AI system called AxiomProver received the same exam. By 3:58 p.m., it had solved eight problems. Within days, it had solved all twelve. No AI system had ever scored perfectly before. No human did in 2025 either, with the top human score at 110 out of 120. AxiomProver’s 12/12 was the first perfect Putnam score not produced by a human.

The score is not the interesting part.

The interesting part is what the score means. Each of AxiomProver’s twelve solutions was a proof written in Lean, a programming language that doubles as a formal mathematical system. When AxiomProver finished, a program checked its work. Not a human grader. A compiler. The compiler either accepted the proof as a valid derivation from the axioms, or it rejected it. There was no partial credit, no interpretation, no subjective margin. Correctness was determined automatically, without appeal.

Two months later, the same system crossed into research mathematics. A mathematician named Dawei Chen had been carrying a problem for five years. With his collaborator Quentin Gendron, Chen had worked on a question in algebraic geometry that came down to a specific formula from number theory. They could not prove the formula. They published the work as a conjecture. For five years, the conjecture sat. At a math conference in Washington in early 2026, Chen mentioned the problem to Ken Ono, a mathematician who had quit a tenured chair at the University of Virginia to become Axiom’s Founding Mathematician. The next morning, Ono returned with a proof. AxiomProver had been given only the statement, with no hints about method or context. It had produced a reformulation in terms of Jacobi symbols, reduced the problem to a combinatorial identity, and verified the identity in Lean. The conjecture was no longer open.

This is the property at the center of Carina Hong’s bet. Axiom Math, the company she founded in July 2025 and for which she raised $200 million at a $1.6 billion post-money valuation in March 2026, is not primarily in the business of getting AI to match human performance on math competitions. That has been possible since DeepMind’s AlphaProof reached silver-medal level at the International Mathematical Olympiad in 2024. What Axiom is doing, and what competition scores cannot quite convey, is deploying a system whose outputs are checkable. Every proof the machine produces is either correct, verifiably, or it is thrown away. There are no hallucinations. There is no “mostly right.” The output is either a theorem or a rejection.

The rest of the AI industry does not yet have this property. What it means to extend this property beyond mathematics is the subject of everything else Hong’s company is building.

The Generation Trap

Hong’s diagnosis of the field is categorical. The AI industry, she argues, has converged on the wrong foundation. The paradigm of the past five years has been generation: train a model to predict the next token, the next pixel, the next action, and scale it until the predictions look correct. The paradigm has produced miracles. It has also produced systematic unreliability. Language models hallucinate. Video models violate physics. Code assistants ship bugs that compile. The pattern is not that the systems fail to generate. The pattern is that the systems generate outputs they cannot verify.

A large language model trained on internet text learns the distribution of human-produced language. It predicts what a plausible next token looks like, given the tokens that came before. This turns out to be enormously useful. It also turns out to have a ceiling. Models that are excellent at producing plausible text are not, by construction, good at distinguishing plausible text from correct text. The training objective does not make this distinction. The model that writes a convincing fake citation and the model that writes a correct real citation are, in the limit of their training, optimizing for the same thing.

Every frontier lab has spent the last two years trying to close this gap. The techniques have names: Reinforcement Learning from Human Feedback, Constitutional AI, process reward models, verifier models, best-of-N sampling, chain-of-thought, test-time compute. Each technique incrementally reduces the rate at which models produce wrong-but-plausible outputs. None of them eliminate it. Hong’s diagnosis is that this is not a tuning problem. It is a paradigm problem. As long as the foundation is generation, the verification layer will always be bolted on afterward, and it will always leak.

Formal mathematics has none of this ambiguity. A proof in a formal system is either accepted by the checker or not. There is no “mostly correct.” There is no confidence score. Either the proof type-checks and the theorem is established, or the proof does not type-check and nothing has been established. The verification is deterministic, machine-checkable, and absolute.

This property is what makes formal math different from almost every other domain AI has tackled. In coding, there is no ground truth about whether a particular function is “right” for a task that was specified in natural language. In language, there is no ground truth about whether a summary is “correct.” In video, there is no ground truth about whether a generated sequence obeys physics. But in formal math, there is a ground truth, and the ground truth can be checked mechanically by a deterministic program whose logic is precisely specified.

The question Hong’s company is asking is: what happens when you build the closed loop? When every output of the generation process is automatically checked, when incorrect outputs are thrown away at machine speed, when the training signal itself comes from the verifier rather than from human preference? The answer, she thinks, is that you get a qualitatively different kind of AI. One that cannot hallucinate, because a hallucination is a proof that does not type-check. One that can prove its own correctness, because the system literally knows when it is wrong.

And the answer is not confined to math.

A Theorem Is a Program

The reason this thesis is not merely aspirational is a correspondence discovered nearly a century ago.

In 1934, the logician Haskell Curry noticed something that would take decades to mature. In a computational system, every function carries a type that specifies its inputs and outputs. Curry observed that these types could be read as logical propositions. In 1969, William Howard made the correspondence precise: every mathematical proof, written in a sufficiently expressive formal system, is isomorphic to a computer program. Every program, conversely, is a proof of the proposition that its type encodes. The two are not analogies. They are the same object viewed from two different sides.

This is the Curry-Howard correspondence, and for decades it was mostly an elegant theoretical result. It said: if you want to verify a mathematical proof, you can compile a program and check that it has the right type. If you want to verify that a program does what its specification says, you can prove a theorem and check its proof. The first application was mostly of interest to logicians. The second was mostly of interest to designers of safety-critical systems.

Lean is a programming language designed around this correspondence. It was created by Leonardo de Moura at Microsoft Research in 2013 and is now developed primarily at Amazon Web Services. A Lean file reads like code: functions, types, definitions, tactics. But when the Lean compiler accepts a file, what it has verified is not merely that the code runs. It has verified that the theorems the code encodes are true, within the axioms of the system. Lean does not distinguish between mathematical objects and computational objects, because the correspondence says there is nothing to distinguish.

“Math is code and code is math,” Hong says. The claim is not a metaphor. It is a statement about what Lean is. When the Axiom team writes a proof in Lean, they are writing a program. When the Lean compiler accepts the program, a theorem has been proved. When the theorem describes a property of another program, that other program has been verified. The same machinery that confirms the Pythagorean theorem can confirm that a sorting algorithm terminates, or that a cryptographic protocol resists a particular attack, or that a piece of C code cannot leak memory.

This generalization is the reason Axiom’s bet is larger than mathematics. If the verification loop that works for math extends, via the Curry-Howard correspondence, to any formal specification, then the same techniques that proved the Putnam problems can eventually prove that a piece of AI-generated code satisfies its specification. This is what the company means when it describes its mission as “Verified AI.”

But the generalization is not free. It requires that the AI can actually produce these proofs, and formal proofs are hard.

Draft, Sketch, Prove

For most of the history of formal verification, the bottleneck was human labor. Writing a formal proof of a substantive theorem takes expert mathematicians months or years. The classic example is the Kepler conjecture, whose formal proof by Thomas Hales and his team required over a decade of work, using a combination of the HOL Light and Isabelle proof assistants. Until very recently, automating this process was out of reach. Automated theorem proving, the state of the art from the 1960s, used hand-crafted rules that could handle specific subdomains but broke down on anything resembling research mathematics. Interactive theorem proving, where a human mathematician works with the system step by step, was the only approach that scaled.

The current paradigm in AI theorem proving emerged from a 2022 paper called Draft, Sketch, and Prove. The idea is simple. First, an informal model reads the problem and produces a proof outline in natural language. Second, a formalization model translates the outline into Lean, leaving unfilled gaps marked with the keyword sorry. Third, a proving model fills in each sorry with formal steps until the Lean compiler accepts the whole file.

The three stages decompose the problem in a useful way. The informal model handles high-level strategy, which is what language models are good at. The formalization model handles the translation from mathematical English to Lean syntax, which is a structured mapping that can be trained. The proving model handles the detailed case analysis, which can be brute-forced by an ensemble of tactics and external tools.

Several of these external tools deserve mention because they reveal what formal math automation actually looks like. The Lean ecosystem includes a mechanism called Hammer, a deliberate naming choice: when a mathematician encounters a tedious subgoal, they can “hammer” it, invoking a collection of first-order logic automated provers that attempt to close it by exhaustive search. Hammer does not use any AI in the modern sense. It is a deterministic system built on decades of automated theorem proving research, and it solves many small subgoals instantly. A newer tool called Grind, released in 2025, extends this further: it can dispatch an entire class of problems that would previously have required AI, with no neural network in the loop.

Hong’s team has noticed something that most outside observers miss. Many demos from AI-for-math startups show impressive neural solutions to problems that Grind would solve deterministically in a second. The interesting design question is not “can the AI do this” but “what is the right division of labor between the cheap deterministic tools and the expensive probabilistic ones?” Axiom’s architecture leans hard on the deterministic layer whenever it can, and brings the language models in only for the parts the deterministic tools cannot reach. This is a design choice that reflects a specific worldview: not that AI is the solution, but that AI plus machine-checkable verification, used with discipline, is.

System, Not Model

If the architecture is a system of components working together, then what is Axiom’s bet, really? The model? The training? The data?

Hong’s answer is blunt. “I bet system, not model.”

The distinction matters. A model bet is a bet that a particular architecture, trained with a particular objective on particular data, will converge on a capability faster than competitors. Most frontier AI investments are model bets: bigger transformer, more tokens, better reinforcement learning. Axiom’s bet is different. It is a bet that the right abstraction level for mathematical reasoning is not a single model but an orchestrated collection of specialized components, each doing what it is best at, coordinated by a planning layer that decides which component to invoke when.

This architecture has a name in the AI for math literature: orchestrator-with-subagents. Anthropic has published work on subagent systems. David Silver and Richard Sutton have written about learning from experience, where each problem-solving trajectory, successful or not, becomes a training example for the orchestrator. Hong’s team has taken these ideas and applied them to formal proofs with what appear to be unusual results. In their internal measurements, a proof tree that used to require around 40 nodes to reach a solution now regularly extends to 4,000 nodes, explored in parallel by subagents that specialize in different proof strategies, each one contributing a partial result that the orchestrator assembles.

The system bet has a second dimension. The team has observed, to their surprise, that a strong theorem prover transfers to strong code verification. A system trained to produce correct Lean proofs for math problems performs well on Verina, a benchmark for verifying code properties, even though it was never trained on code. The reason, Hong speculates, is structural. A proof in Lean and a verified program in Lean are the same kind of artifact. If the model learns to produce one, it has, by the Curry-Howard correspondence, learned to produce the other. This is the transfer path from mathematics to software.

It is also the point at which Axiom’s addressable market starts to look different from traditional formal verification. The historical market for formal methods was tiny: a few billion dollars, concentrated in mission-critical domains like aerospace and cryptography, where the cost of verification was justified only for systems where failure was catastrophic. Twenty lines of formal proof for every line of code is not a cost that anyone pays voluntarily. If AI can automate that cost away, the market becomes something else entirely. The right-of-first-refusal to verify every line of AI-generated code. Every contract. Every protocol. Every piece of software where anyone anywhere cares whether it does what it claims.

Verified AI

This is what Axiom means by “Verified AI.” It is not a safety feature. It is not a guardrail bolted onto a generation model. It is a reorientation of what an AI system is for.

In Hong’s framing, the generation paradigm answers the question: what is the most plausible output? The verification paradigm answers a different question: what output can I prove is correct? The second question is harder to answer but, when answered, the answer means more. The first question’s answer is statistical. The second question’s answer is deductive.

A specification is a type in the Curry-Howard sense, a formal description of what should be true of an object. A proof that the specification holds is a program. A program is a proof. Definitions become executions, specifications become proofs, and the boundary between the two dissolves.

The consequence Hong draws is that this loop, once it closes, accelerates. A system that can verify its own outputs can also generate new conjectures that it then tries to prove. When the conjectures are difficult and interesting, and when the proving system is fast enough, the rate of mathematical discovery no longer depends on the rate at which human mathematicians can do formal work. It depends on compute. This is the meaning of the phrase “mathematical superintelligence.” It is not that the AI is smarter than individual mathematicians. It is that the bottleneck is no longer human attention.

Recursive Self-Improvement

The final claim in Hong’s thesis is the one that most resembles a venture capital slogan and, in her framing, is not a slogan at all.

“I completely believe recursive self-improvement is achievable in the near term,” she says. The reasoning is specific to the verification loop. Once a system can prove that a proposed improvement to itself preserves a formal specification, it can apply the improvement and trust the result. The classical obstacle to self-improving AI is that nobody can check the improvements. The classical obstacle dissolves when every improvement comes with a proof.

This is the point at which the Verified AI thesis stops being a thesis about a specific company and becomes a thesis about AI development as a whole. If Hong is right, then the path to the systems frontier labs have been describing for a decade does not go through bigger language models. It goes through the closure of the verification loop. And the closure of the verification loop is precisely what formal mathematics was designed for.

It may not work. Formal verification has been “nearly ready” for decades. Lean is still orders of magnitude slower than the engineering workflows it would need to replace. The informal-to-formal translation, which the Draft-Sketch-Prove pipeline depends on, remains unreliable for anything beyond clean textbook problems. Every member of Hong’s team knows these obstacles. Every investor who put money into Axiom’s Series A at $1.6 billion knows them too. The bet is that the obstacles yield to compute and clever system design in the same way previous obstacles in AI have yielded, and that the capability unlocked on the other side is worth the risk.

The alternative bets are visible. OpenAI is working on informal reasoning, driven by a senior researcher’s ambition for scientific discovery that language-model reasoning might eventually support. Google DeepMind has parallel teams working on formal and informal approaches, with AlphaProof as its flagship. Anthropic uses Lean data as a reinforcement learning reward signal but does not treat formal methods as a core strategic direction. Hong’s competitors in the startup space, including a company called Harmonic that has raised $295 million, share her thesis that verification matters, and differ on architectural details.

What distinguishes Axiom is the commitment to the full loop. Not generation with a verifier attached. Not verification as a QA step. But the conjecturing-proving cycle as the primary mechanism of intelligence, with formal mathematics as the domain in which the cycle is cleanest and the transfer to code and beyond as the commercial thesis.

The Putnam result that opened this piece is a demonstration, not a product. Axiom’s product, if it works, will be invisible in the way that a compiler is invisible: something in the background that turns human intent into verified machine outputs, with the guarantee that the output does what the intent specified. The bet that this machinery is worth $1.6 billion is a bet that the AI industry, after a decade of learning to generate, is ready to learn to prove.

Whether the bet pays off will be clearer in a few years. What can be said now is that the object at the center of it is not new. Curry and Howard noticed the correspondence. Hundreds of mathematicians have been writing Lean proofs for a decade. Automated theorem proving as a field is older than deep learning. The difference, in Hong’s claim, is that the ingredients are finally in place for the loop to close.

She is 25 years old. She told her seed-round investors she did not know what her business model would be. They funded her anyway. What they were funding, she thinks, was not the business model. It was the bet that this time the ingredients are ready.

Math is code, and code is math. If that old claim turns out to be operationally true, the implications extend well beyond the conference room in Palo Alto, beyond the mathematicians who spent five years on a conjecture that an AI closed overnight. It extends to every system that is asked, by anyone anywhere, to do a thing correctly. Which, increasingly, is every system.

Next: Part 2 explores the tension between brute force and elegance in mathematical proof, the arrival of Ken Ono, and the question of whether a machine can learn to see what Erdős called “proofs from The Book.”

Avik De

Apr 23

Thanks, I was meaning to find out more about this company. I don’t know Lean well but I always assumed that if it’s like any other programming language, the key is that it is generative. As in, you have primitives and operators and can start just combining them to generate new things. In lean’s case those may become new lemmas or theorems, which are potentially more useful than an LLM just generating a novel. Additionally, like I wrote about in the context of robotics, I’m a big believer in the pairing of language models with tools, so I’m looking forward to seeing what comes of this!

1 reply by Hugo

1 more comment...

Robonaissance

Discussion about this post

Ready for more?