via TechCrunch AI
The Dirty, Unglamorous Work of Collecting Robot Training Data—and Why AI Labs Are Paying XDOF to Do It
Two weeks ago, OpenAI announced it would relaunch its robotics program, which it had shuttered in 2021—the latest sign that major AI labs are racing to teach machines to operate in the physical world. But building capable robots requires something the AI industry still lacks: training data comparable to the vast datasets used for language models.
This gap is creating a new kind of infrastructure business. Unlike large language models (LLMs) trained on a sea of publicly available text, robots need data that captures physical interaction—and that kind of data barely exists. YouTube videos and footage from gig workers are often low-fidelity and hard to reconcile with real-world tasks.
## XDOF: Building the Data Pipelines for Physical AI
XDOF (pronounced “ecks-doff”), emerging from stealth today, is betting that the next great bottleneck in AI isn’t models or chips—it’s the data feedback loop needed to teach robots how to interact with the physical world. The startup aims to build the data pipelines, collection tools, and annotation systems that frontier labs and robotics companies can’t easily build themselves. It has raised $70 million from Thrive Capital, Spark Capital, a16z, Lux, and WndrCo. Co-founder and CEO Philipp Wu says XDOF, which has about 60 employees, is already working with 20 customers, including several frontier AI labs (though he cannot name them).
“All of the top labs are trying to pursue robotics,” Wu said. “We’ve already seen some of the downfalls of falling a little bit behind in the language model race … you don’t want to be in this type of situation where you pursue this technology too late. Everyone is in this boat where physical AI is the next frontier.”
Wu ran into this problem as a PhD student at UC Berkeley, where his focus was on enabling robots to learn skills from large-scale datasets. “We didn’t have large-scale data to work with,” he told TechCrunch. “There was this chicken-and-egg problem—we first needed to actually collect data before we could even ask how to train a foundation model for robotics.”
## From GELLO to a Full Data Ecosystem
Wu and his future XDOF co-founder and CTO, Fred Shentu, previously worked on GELLO, a low-cost teleoperation system that lets a human operator control a robotic arm to generate training data. “It ended up becoming a very influential paper in robotics, because a lot of people had similar needs and bottlenecks, and many started leveraging this type of device for data collection,” Wu said.
Spotting the opportunity, Wu, Shentu, and third co-founder and Chief Operating Officer Nemo Jin launched XDOF in October 2024 to provide a data ecosystem for companies pursuing robotics models. Mindful that data provision alone can be a dead-end business, the company is also focused on data cleaning, tooling, and annotation—creating a self-reinforcing feedback loop for robot trainers.
## Releasing the ABC Dataset: A Milestone for Robotics Research
As a starting point, XDOF is partnering with UC Berkeley’s AI Research lab to release what it believes is the largest collection of high-quality robot training data ever assembled, dubbed ABC. The dataset includes 130,000 trajectories of robot manipulation data, 300 hours of simulation, and 100 hours of evaluations. This scaled-up pre-training data has never been available to academia before, and it could accelerate research into foundation models for robotics.
By 2026, as more AI labs push into physical-world applications, the demand for structured, high-quality robot training data is expected to surge. XDOF is positioning itself as a critical infrastructure layer in that ecosystem—doing the dirty, unglamorous work of collecting, cleaning, and annotating the data that makes physical AI possible.
