Eyes

The Robot's ImageNet Moment, and why seeing isn't the same as understanding.

Jan 26, 2026

Chapter 3 of A Brief History of Embodied Intelligence

“A room full of robots failing over and over, with no human intervention. After 800,000 attempts, they could pick up things they had never seen before.”

On October 1, 2012, in a conference room in Florence, Italy, the results of the ImageNet challenge were announced. The room was packed with computer vision researchers, most of whom had spent years refining hand-crafted feature detectors: edge histograms, gradient orientations, carefully tuned filters.

When the numbers appeared on screen, the room went quiet.

A team from Toronto had achieved an error rate of 15.3 percent. The second-place system, using traditional methods, had managed 26.2 percent. The gap wasn’t a modest improvement. It was a chasm, the largest single-year improvement in the competition’s history.

The winning system was a neural network called AlexNet, an approach that most researchers in the room had written off years ago.

The ImageNet Large Scale Visual Recognition Challenge was an annual competition where teams built systems to classify images into a thousand fine-grained categories: specific dog breeds, car models, bird species, food items. The best systems in previous years had achieved error rates around 26 percent. AlexNet had shattered that ceiling.

The winning team was led by Geoffrey Hinton, a University of Toronto professor who had spent thirty years advocating for neural networks when almost everyone else had given up on them. His graduate students, Alex Krizhevsky and Ilya Sutskever, had trained the network on two gaming GPUs, graphics cards designed for video games, repurposed for artificial intelligence. The entire setup cost a few thousand dollars.

Within months, every major tech company was scrambling to hire deep learning researchers. The term “deep learning” was itself a strategic rebranding. Neural networks had accumulated so much skepticism over the decades that Hinton and his colleagues found it easier to give the field a new name. Within years, the techniques that powered AlexNet would transform not just image recognition but language, speech, and eventually robotics itself.

But for roboticists watching in 2012, the implications were both exciting and sobering. AlexNet had solved a problem that had plagued robot vision for decades: how do you make a machine recognize objects it has never seen before? And yet solving that problem only revealed how much further there was to go.

The Problem of the Long Tail

Traditional robot vision worked like this: an engineer would study an object, say, a coffee mug, and identify its distinctive features. Mugs have handles. They’re cylindrical. They have an opening at the top. The engineer would write code to detect these features, then combine them into a classifier that could recognize mugs.

This approach worked, up to a point. A well-engineered system could reliably recognize the objects it was designed for. But the real world contains millions of different objects, and the distribution of those objects follows what statisticians call a long tail.

A few object categories are extremely common: chairs, tables, cups, phones. These sit in the “head” of the distribution, and you can train a robot to recognize them. But the vast majority of object categories are rare. The vintage lamp in your grandmother’s attic. The unusual kitchen gadget you bought on vacation. The child’s art project sitting on the counter. These sit in the “tail” of the distribution, and there are effectively infinite variations.

Traditional feature engineering couldn’t scale to the long tail. Each new object category required a human expert to study it, identify its distinctive features, and write new code. The process was slow, expensive, and fragile. A system trained on white coffee mugs might fail on black ones.

This was why, before 2012, robots operated almost exclusively in controlled environments. In a factory, you could guarantee that the robot would only encounter objects it had been programmed to recognize. In a home, you couldn’t guarantee anything.

The Promise of End-to-End Learning

AlexNet offered a different approach: don’t tell the machine what features to look for. Show it millions of examples and let it figure out what matters.

This is called end-to-end learning, in the context of vision, meaning the network learns directly from raw pixels to classification, without hand-crafted features in between. (Later, roboticists would extend this idea further, building systems that learned from pixels all the way to motor commands.) Instead of a pipeline where humans design features and then train a classifier on those features, you feed raw pixels directly into a neural network and let the network learn everything: which edges matter, which colors matter, which shapes matter, how to combine them into recognizable objects.

The key insight was that neural networks could discover features that humans would never think to look for. The early layers of AlexNet learned to detect edges and color gradients, features that human engineers had long used. But the deeper layers learned increasingly abstract patterns that had no obvious names. They weren’t “handles” or “cylinders.” They were complex combinations of visual elements that the network had discovered were useful for classification.

For robotics, this meant something profound. A robot no longer needed an engineer to tell it how to recognize every possible object. It just needed data, lots of data, and a powerful enough network to learn from that data.

The long tail hadn’t disappeared. But now you could address it by collecting more examples rather than writing more code.

From Recognition to Understanding

There was just one problem: recognizing what something is and understanding what to do with it are very different tasks.

AlexNet could look at an image and say “coffee mug” with high confidence. But a robot trying to pick up that mug needs to know much more. Where exactly is the mug? What’s its orientation? Which way is the handle facing? Is there liquid inside? Is the mug sitting on a stable surface or balanced precariously on the edge of a table?

Image classification, the task AlexNet solved, answers the question “what is this?” Robot manipulation requires answers to “where is it?”, “how is it oriented?”, “what’s around it?”, and “how should I approach it?”

This progression from recognition to understanding drove the next wave of computer vision research. Object detection systems learned to draw bounding boxes around objects, localizing them in space. Semantic segmentation systems learned to label every pixel in an image, distinguishing the mug from the table from the background. Pose estimation systems learned to infer the three-dimensional orientation of objects from two-dimensional images.

Each of these advances brought vision systems closer to what robots actually needed. But there was still a fundamental gap: images are two-dimensional, and the world is three-dimensional. Depth can be inferred from 2D images, but in the early 2010s, these techniques were either computationally expensive or not yet reliable enough for real-time robot manipulation.

The Kinect Revolution

The solution came from an unexpected source: a video game accessory.

In November 2010, Microsoft released the Kinect, a motion-sensing device for the Xbox 360. The Kinect used infrared sensors to create a depth map of the room, allowing players to control games with their body movements. It sold eight million units in its first sixty days, making it the fastest-selling consumer electronics device in history at the time.

Microsoft was selling entertainment. Robotics researchers saw something else entirely: cheap, reliable depth sensing.

Before the Kinect, depth sensors were expensive research equipment, tens of thousands of dollars for a decent unit. The Kinect cost $150. Within months of its release, researchers had reverse-engineered the device and written open-source drivers that let them use it for robotics applications.

The impact was immediate. Suddenly, any lab could afford to give their robots depth perception. A graduate student with a laptop and a Kinect could experiment with three-dimensional vision systems that would have required specialized equipment and substantial funding just a year earlier.

The Kinect captured what’s called RGB-D data: regular color images (RGB) plus depth information (D) for every pixel. This combination turned out to be exactly what robots needed. The color image provided rich visual information for recognition. The depth map provided the spatial information needed for manipulation.

Point Clouds and PointNet

Depth sensors don’t produce images in the traditional sense. They produce point clouds, collections of thousands or millions of points in three-dimensional space, each representing a spot where the sensor detected a surface.

Point clouds are messy. Unlike images, which have a regular grid structure, point clouds are unordered and irregular. The points are scattered through space wherever surfaces happen to be. Two scans of the same object from different angles will produce completely different point arrangements.

This messiness posed a problem for deep learning. Neural networks like AlexNet were designed for regular grids. They expected inputs arranged in neat rows and columns. Point clouds had no such structure.

In 2017, a Stanford research team led by Charles Qi introduced PointNet, a neural network architecture designed specifically for point clouds. The key insight was to process each point independently first, then aggregate information across all points. This made the network invariant to the order of points. It didn’t matter how the cloud was arranged, only what surfaces were present.

PointNet and its successors finally brought the power of deep learning to three-dimensional perception. Robots could now learn to recognize objects, estimate poses, and segment scenes directly from depth data, without converting to artificial image representations.

The Berkeley Experiments

While vision researchers were improving recognition, a group at UC Berkeley was asking a different question: could robots learn to see and act at the same time?

Pieter Abbeel had arrived at Berkeley in 2008 with a background in machine learning and a fascination with robotics. His early work focused on learning from demonstration, having robots watch humans perform tasks and then imitate them. But he increasingly believed that robots needed to learn from their own experience, not just human examples.

In 2015, one of Abbeel’s graduate students, Sergey Levine, commandeered a corner of the Berkeley robotics lab and set up something that looked like an assembly line from a fever dream.

Fourteen identical robot arms stood in a row, each facing a plastic bin filled with random objects: rubber ducks, tape dispensers, foam blocks, kitchen utensils, anything Levine and his students could grab from around the building. Above each arm, a camera pointed down at the bin. The arms whirred and clicked continuously, day and night, attempting to pick things up.

Most attempts failed. Arms closed on empty air. Objects slipped from grippers. Bins got knocked over. The lab filled with the sound of plastic clattering and motors resetting. Visitors found it equal parts fascinating and unnerving. A room full of robots failing over and over, with no human intervention.

“The first few days, it was honestly depressing to watch,” Levine later recalled. “They couldn’t pick up anything. But we knew the data was accumulating.”

The robots had cameras but no hand-crafted vision systems. They had arms but no pre-programmed grasping strategies. They had to learn everything from scratch, through trial and error. Each grasp attempt, successful or not, fed into a neural network that was slowly learning which visual patterns predicted a successful grasp.

The experiment ran for two months. After 800,000 grasp attempts, the equivalent of several human lifetimes of practice, something remarkable emerged. The robots had learned to pick up novel objects with about 80 percent success. They could grasp things they had never seen before, approaching them from angles that made sense given their shape and position. No one had programmed these strategies. The robots had discovered them.

The experiment demonstrated something important: you didn’t need to separate vision and action. A single neural network could learn to connect what the robot saw directly to how it should move. This end-to-end approach, from pixels to motor commands, would become central to the next generation of robot learning.

Visuomotor Learning

The Berkeley grasping experiments exemplified what researchers call visuomotor learning: training systems that connect visual perception directly to motor control.

In traditional robotics, vision and action were separate modules. A vision system would analyze the scene and output a symbolic description: “there is a red mug at position X, Y, Z with orientation θ.” A planning system would take that description and compute a motion plan. A control system would execute the plan.

This modular approach had advantages, you could improve each component independently, but it also had a fundamental limitation. The symbolic description was a bottleneck. It could only include information that the engineer thought to extract. If the vision system didn’t report that the mug was wet, the planning system couldn’t account for changed friction.

Visuomotor learning eliminated the bottleneck. The neural network learned its own internal representation of the scene, whatever features it found useful for the task at hand. If wetness mattered for grasping, the network could learn to detect it, even if no human had ever told it to look for wetness.

This didn’t mean the representations were better than hand-crafted features in every way. They were often harder to interpret, more data-hungry, and less predictable. But they were more flexible, more scalable, and crucially they could improve with more data rather than more engineering.

What Robots Could See by 2020

By the end of the decade, the combination of deep learning, depth sensing, and end-to-end training had transformed what robots could perceive.

Object recognition was essentially solved for common objects in good conditions. A well-trained network could identify thousands of object categories with superhuman accuracy, at least when the lighting was reasonable and the object was clearly visible.

Pose estimation had advanced to the point where robots could infer the three-dimensional position and orientation of objects well enough for many manipulation tasks. Systems could estimate how to grasp a mug even when they’d never seen that specific mug before.

Scene understanding had progressed from recognizing individual objects to understanding spatial relationships. Robots could parse a cluttered desk, distinguishing the laptop from the papers from the coffee cup and understanding roughly how they were arranged.

Depth sensing had become cheap and ubiquitous. The Kinect was discontinued in 2017, but by then dozens of alternatives existed. Intel’s RealSense cameras cost under $200 and offered better resolution than the original Kinect. Lidar sensors, once exclusive to self-driving cars, were becoming affordable for indoor robots.

And visuomotor learning had demonstrated that robots could learn to connect perception directly to action, acquiring manipulation skills through experience rather than programming.

The Limits of Seeing

But perception, however sophisticated, wasn’t enough.

Consider a robot tasked with clearing a table after dinner. Even with perfect vision, complete knowledge of every object’s position, orientation, and identity, the robot still faces profound challenges.

Which objects should be cleared? The empty plates, certainly. But what about the half-full water glass? The napkin the host might still be using? The centerpiece that’s supposed to stay?

In what order should objects be cleared? The tall candlestick might block access to the plate behind it. The sauce boat might spill if jostled. The dirty dishes should probably go in the sink, but the glasses should go somewhere else.

How should each object be handled? The fine china requires gentle treatment. The cast-iron pan can withstand more force. The sharp knife should be held by the handle, not the blade.

These questions can’t be answered by perception alone. They require understanding context, social norms, physical consequences, and the goals of the task. They require, in a word, intelligence.

The deep learning revolution had given robots eyes. Now they needed brains.

From Seeing to Understanding

The progression from ImageNet to robot manipulation followed a pattern that would repeat throughout the field.

First came a breakthrough in a narrow capability: image classification. Then came extensions to related capabilities: object detection, segmentation, pose estimation. Then came integration with other systems: connecting vision to motor control.

At each step, the community discovered that solving one problem revealed the next. Recognition without localization wasn’t enough. Localization without depth wasn’t enough. Depth without manipulation skills wasn’t enough. Manipulation without understanding wasn’t enough.

This cascade of “not enough” might seem discouraging. But it also represented genuine progress. Each step up the ladder was necessary groundwork for the next. You couldn’t tackle robot understanding without first having robot perception. The ImageNet moment didn’t give robots minds. But it gave them the foundation on which minds could eventually be built.

By 2020, robots could see better than they ever had before. The question now was whether they could learn to think.

Notes & Further Reading

On ImageNet and AlexNet: The original paper, “ImageNet Classification with Deep Convolutional Neural Networks” by Krizhevsky, Sutskever, and Hinton (2012), is accessible to readers with some technical background. For the broader story, Cade Metz’s Genius Makers (2021) provides excellent narrative coverage of Hinton and the deep learning revolution.

On the history of computer vision: Fei-Fei Li’s memoir The Worlds I See (2023) tells the story of ImageNet’s creation from its creator’s perspective. For technical history, Szeliski’s Computer Vision: Algorithms and Applications (2nd edition, 2022) provides comprehensive coverage.

On robot perception: The papers from the Berkeley robotics lab provide the best primary sources. “Learning Hand-Eye Coordination for Robotic Grasping with Deep Learning and Large-Scale Data Collection” by Levine et al. (2018) describes the fourteen-robot grasping experiment. For Pieter Abbeel’s research trajectory, his talks at NeurIPS and ICRA conferences, many available on YouTube, offer accessible overviews.

On PointNet and 3D deep learning: “PointNet: Deep Learning on Point Sets for 3D Classification and Segmentation” by Qi et al. (2017) introduced the architecture. For a gentler introduction, Charles Qi’s thesis work at Stanford is well-documented in blog posts and talks.

On depth sensing and the Kinect: The story of Kinect’s impact on robotics research is told in various IEEE Spectrum articles from 2011-2012. For technical details on RGB-D perception, the work by Dieter Fox’s group at University of Washington provides excellent examples.

On visuomotor learning: Sergey Levine’s research page maintains an extensive list of publications on end-to-end robot learning. Chelsea Finn’s thesis on meta-learning for robotics, available online, provides a comprehensive treatment of learning to learn in robotic systems.

On the limits of perception: Rodney Brooks’ essay “Intelligence Without Representation” (1991) anticipated many of the challenges that would emerge as perception improved. His critiques remain relevant for understanding what perception alone cannot achieve.

Robonaissance

Discussion about this post

Ready for more?