Intention Is All You Need
When machines can perceive, decide, and act, what’s left for us?
“Automating the hard things felt like building a better tool. Automating the easy things feels like building a better us.”
In 2017, a team of eight researchers at Google published a paper with a title that became a meme: “Attention Is All You Need.” The paper introduced the Transformer architecture, which would go on to power ChatGPT, Gemini, Claude, and nearly every major AI system built since. The insight was elegant: instead of processing information sequentially, let the model attend to all parts of the input simultaneously, learning which parts matter most. Selective, learnable, scalable: attention was the mechanism that unlocked everything.
Eight years later, attention has been extended from language into the physical world. Foundation models can now process camera feeds, proprioceptive data, and force-torque readings, then output motor commands that make a robot arm fold a towel or pour a cup of coffee. The perception-decision-action loop that defines embodied intelligence: the ability to see an environment, figure out what to do, and do it. That loop is being automated. Not fully. Not reliably across all conditions. But unmistakably.
Which raises a question the original paper never anticipated. If attention is all the machines need, what do the humans need?
I want to argue that the answer is intention.
The Reversal Nobody Expected
For decades, we automated the things humans find hard. Chess. Calculus. Protein structure prediction. Medical diagnosis from imaging scans. Each breakthrough was impressive, and each one left our sense of ourselves largely intact. No one looked at Deep Blue beating Kasparov in 1997 and wondered what it meant to be human. The things computers excelled at (logical reasoning, exhaustive search, pattern matching across vast datasets) were things we admired but did not define ourselves by. A person who cannot solve a differential equation does not feel diminished as a person.
Now we are automating the things humans find easy.
Walking. Picking things up. Navigating a cluttered room. Folding laundry. Pouring coffee. Carrying a box from one shelf to another. These are the things we do without thinking, the things so woven into daily existence that we barely notice them. A toddler masters most of them by age three.
They are also the things that evolution spent the longest building into us. The neural systems underlying movement and coordination are hundreds of millions of years old. Hand-eye coordination developed over tens of millions of years. The ability to read a physical situation and respond fluidly is so deeply embedded in our biology that it feels less like a skill and more like a part of who we are. Catching a tossed ball, stepping around an obstacle, steadying a wobbling cup before it falls: we do these things without thinking.
This is what the roboticist Hans Moravec noticed in the 1980s: the tasks we consider intellectually demanding are computationally easy, while the tasks we consider trivially simple are computationally staggering. A chess engine requires elegant algorithms. A robot that can walk across a room requires solving hundreds of equations per second just to stay upright. Moravec’s Paradox, as it came to be known, explained why AI conquered chess decades before it could cross a room.
But there is a corollary to Moravec’s Paradox that is rarely discussed, and it matters more than the original observation.
Automating the hard things felt like building a better tool. Automating the easy things feels like building a better us.
The difference is not economic. It is existential. When a machine does calculus, it performs a function. When a machine walks through a room, picks up an object, and places it somewhere useful, it performs a behavior, the kind of behavior that, for most of human history, was what most humans spent most of their time doing.
Tasks vs. Capabilities
Every previous wave of automation eliminated specific tasks. The spinning jenny eliminated hand spinning. The assembly line eliminated artisanal car production. The spreadsheet eliminated rooms full of human calculators. In each case, the displacement was painful and the transition was unevenly distributed. But the tasks that disappeared were, in the end, tasks, discrete activities that machines could perform faster, cheaper, or more consistently.
What is emerging now is categorically different.
The robot foundation models being built in 2025 and 2026 by companies like Physical Intelligence, Generalist AI, and the robotics divisions of NVIDIA and Amazon are not designed to automate a task. They are designed to automate a capability: the general ability to perceive a physical environment, decide what to do, and act on that decision, across any environment, with any objects, for any purpose.
Physical Intelligence’s π0.5, for instance, is a vision-language-action model that can control a mobile manipulator to clean kitchens and bedrooms it has never seen before, navigating unfamiliar layouts, recognizing objects in new positions, and completing multi-step tasks lasting ten to fifteen minutes. It was not programmed for any of these environments individually. It learned from data, the way a foundation model learns language: by exposure to enough examples that general patterns emerge.
In November 2025, a startup called Generalist AI published a scaling graph showing that their foundation model exhibited a phase transition, a sudden nonlinear improvement in capability, at around seven billion parameters. The same phenomenon that made large language models suddenly coherent was appearing in physical manipulation. More data and bigger models produced not just incremental improvement but qualitative jumps in what the robot could do.
If this trajectory continues (and billions of dollars in venture capital are betting that it will), the result is not a better loom. It is an attempt to replicate general physical competence itself.
This distinction matters because of what it implies for the human fallback.
When mechanical looms replaced hand weavers in late eighteenth-century England, the displacement was devastating. In Nottinghamshire and Yorkshire, entire communities organized around cottage weaving were dismantled within a decade. But the weavers themselves retained their general physical capability. They could walk to a new factory, learn a new trade, use their hands in a thousand other ways. The machine could only weave. The human could adapt.
What happens when the machine can also adapt? When the thing being automated is not spinning or weaving or calculating but the general capacity to perceive, decide, and act in the physical world, the very capacity that allowed displaced workers to find new work?
The answer is not that humans become useless. It is that the basis of human value shifts, perhaps permanently, away from what we can do.
What Precedes the Loop
Toward what?
There is a clue in the structure of the technology itself.
The modern robot foundation models operate on the same loop: perceive, decide, act. A single neural network takes in raw sensor data and outputs motor commands directly, end-to-end, dozens of times per second. The entire edifice of modern robotics research is devoted to making this loop faster, more accurate, and more generalizable.
But there is something the loop does not contain, something it cannot contain, something that must come from outside the system entirely.
The loop does not know why it is running.
A robot that folds laundry does not know why clean laundry matters. A robot that carries boxes does not understand supply chains, consumer demand, or the human desire for convenience. A robot that makes coffee does not understand the social ritual of sharing a drink. The perception-decision-action loop handles the how. The why comes from somewhere else.
That somewhere else is intention.
Intention is the thing that precedes the loop. It is the decision that a particular problem is worth solving, that a particular future is worth building toward, that a particular action has meaning beyond its physical execution. It is what Karol Hausman exercised when he left a secure research position at Google to build Physical Intelligence. Not because an algorithm told him to, but because he believed that general-purpose robot intelligence was possible and important and that he was the right person to build it. It is what Jensen Huang exercised when he pivoted NVIDIA from gaming chips to AI infrastructure years before the market validated the bet, standing on CES stages year after year, waiting for the rest of the industry to catch up with his conviction.
It is, if we are honest, what every significant advance described in the history of robotics was driven by. Not optimization. Not gradient descent. Not scaling laws. But a human being deciding that something mattered and organizing effort (years of effort, careers of effort, sometimes lifetimes of effort) around that decision.
Perception, decision, and action can be replicated in silicon and steel. Intention, the question of what is worth doing, cannot. Not yet, and perhaps not for a very long time. Because intention requires something that no current AI architecture possesses: a stake in the outcome. You have to care about the answer to ask the question.
The Thin Consolation Problem
I can hear the objection already. It is the same objection I would raise if someone else wrote this article.
“Intention” is cold comfort to a warehouse worker whose job is eliminated by a robot that perceives, decides, and acts well enough to replace him. Telling someone that humans still have a monopoly on “meaning” does not pay the rent. The economic displacement is real, the transition will be painful, and the benefits will be unevenly distributed, just as they were with every previous wave of automation, and probably more so, given the speed and breadth of what is coming.
This is a fair objection. I take it seriously. But I want to push back on the implicit assumption behind it: that the only important question about embodied intelligence is the economic one.
The economic question, who benefits, who is displaced, how we manage the transition, is urgent and deserves serious policy attention. But it is not the deepest question, and treating it as the only question impoverishes the conversation.
Here is why.
When mechanical looms displaced weavers, the weavers did not disappear. They adapted. Over time, painfully and unevenly and over generations, humans migrated to work that the machines could not do: teaching, engineering, design, management, care, art, persuasion, leadership. The range of things that only humans could do expanded even as specific tasks were automated, because automation freed human attention for work that required judgment, creativity, empathy, and social understanding.
This happened not because anyone planned it, but because humans have a remarkable capacity to discover new forms of value when old forms are disrupted. The source of that capacity is not physical strength or manual dexterity, qualities that machines have now matched or will soon match. It is the ability to look at a changed world and ask: What matters now? What is worth doing that was not worth doing before? What new problem has the new technology created that only a human can solve?
That ability is intention in action. And it is what will drive the next adaptation, just as it drove the last one.
The warehouse worker whose job is automated will not be saved by the abstract concept of “intention.” He will need concrete policies, retraining programs, and economic structures that give him the time and resources to find new work. But the kind of new work he finds, and the kind of value he creates, will be determined by something no policy can provide: his own sense of what matters.
What Leonardo Was Really Asking
The question of what separates human intention from mechanical execution is not new. It is, in fact, one of the oldest questions in the history of machines.
In 1495, Leonardo da Vinci sketched designs for a mechanical knight, a suit of armor animated by pulleys and gears, capable of sitting, standing, raising its visor, and moving its arms. We do not know if he ever built it. The sketches survived scattered across his notebooks, not fully understood until modern roboticists reconstructed them five centuries later.
Leonardo was also an anatomist. He dissected human cadavers, at least thirty, procured from hospitals and morgues in Florence and Rome, working by candlelight in rooms that reeked of decay, peeling back layers of muscle with a surgeon’s knife in one hand and a piece of red chalk in the other. He wanted to understand how muscles attached to bone, how tendons transmitted force, how the hand could grip and the arm could reach.
Then he turned around and designed machines that mimicked the motion.
He was not solving an engineering problem. Or rather, the engineering problem was in service of a deeper inquiry. Leonardo wanted to know: What is the difference between a living thing and a machine that moves like one?
Five hundred years later, the question has become urgent in a way he could not have imagined. His mechanical knight was a curiosity, a thought experiment made of wood and rope. The machines being built today can walk through warehouses, fold laundry, make coffee, and, tentatively, improvise when they encounter something they have never seen before. The gap between what a machine can do and what a human can do is narrowing.
And yet.
Leonardo’s knight could sit and stand and raise its visor. But it was Leonardo who wondered what it would be like to build one. He chose to spend years studying anatomy not because anyone required it, but because the question of how bodies work fascinated him. He chose to sketch the knight not because there was a market for mechanical soldiers, but because the boundary between the living and the mechanical haunted him.
The machines we are building are increasingly capable of doing. They are not capable of wondering why they do it.
Attention, Then Intention
Let me close the loop on the title.
“Attention Is All You Need” was a statement about machine architecture. It described the mechanism (learnable, scalable, parallelizable) that unlocked the current era of AI. Attention is the how: how a model decides which inputs matter, how it weighs information, how it allocates computational resources.
Intention is the why. It is what determines that a particular problem deserves attention in the first place. It is what makes attention meaningful rather than merely functional.
Machines have attention. They will increasingly have perception, decision-making, and the ability to act in the physical world. They will fold our laundry, carry our boxes, perhaps eventually tend our gardens and care for our elderly. The scaling laws suggest this. The investment guarantees the attempt.
What they will not have is the capacity to decide that any of this matters. Not in any architecture currently imagined. Not in any scaling curve currently plotted. They will not wonder whether the laundry should be folded or whether the garden should be tended. They will not ask whether the world they are building is the world worth building. They will not care.
We will.
That is not a small thing. In a world increasingly populated by machines that can do, the ability to decide what is worth doing becomes the scarcest and most valuable resource. Not scarce because it is difficult in the way that computation is difficult. Scarce because it requires something no current computational architecture provides: a point of view, a set of values, a stake in the future.
The age of embodied intelligence is arriving. The machines will perceive, decide, and act with increasing competence. The question for us, as professionals, as investors, as citizens, as humans, is not whether we can compete with them at doing.
It is whether we can get better at intending.
Note: The ideas in this article are explored in much greater depth in our book, A Brief History of Embodied Intelligence: From Da Vinci’s Mechanical Knight to Optimus.



Great read. Intention Is All You Need in the Post-Human Era.