Most state-of-the-art robot learning datasets for manipulation rely on a limited set of modalities—task descriptions, joint states (observations and actions), and synchronized RGB images. While sufficient to train vision-language-action (VLA) models, these signals capture only a fraction of how we perceive and interact with the world.
In contrast, robotics navigation has long embraced richer representations—such as depth, point clouds, and maps—highlighting the importance of multimodal perception. Humans, similarly, do not rely on vision alone: we touch, hear, and continuously estimate distances and forces. Extending manipulation datasets to include such modalities—audio, tactile feedback, or depth—offers a path toward more robust, adaptive, and generalizable robot learning systems.
However, increasing modality diversity is not simply a matter of adding more data streams. It introduces significant challenges in data collection, synchronization, storage, and standardization. Moreover, scaling datasets in a single direction can quickly lead to inefficiencies that hinder training and usability.
This talk presents the challenges encountered when introducing new modalities into the LeRobot dataset, along with the design decisions made to balance diversity with efficiency. It discusses practical solutions for integrating heterogeneous data while maintaining scalable and usable dataset structures.