This repository accompanies the research paper EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video.
EgoDex is a large-scale dataset and benchmark for egocentric dexterous manipulation collected with ARKit on Apple Vision Pro. The dataset has 829 hours of 30 Hz 1080p egocentric video with paired 3D pose annotations for the head, upper body, and hands as well as natural language annotation. It consists entirely of active tabletop manipulation across 194 diverse tasks.
The EgoDex dataset is split into a training set, test set, and additional data collected after the 99/1 train/test split was frozen for all experimental results. These have about 725 hours, 7 hours, and 97 hours, respectively. The training set is further divided into 5 zip files for portability. It is easiest to download the test set for initial exploration as it is the smallest zip file. Or, for users interested in robot deployment, the basic_pick_place
task in training set Part 2 has a large amount of very diverse pick-and-place data and high-quality language annotation.
The data is available for download at the following URLs:
To download, simply click the links, or use curl
from the command line:
curl "https://ml-site.cdn-apple.com/datasets/egodex/test.zip" -o test.zip
unzip test.zip
curl "https://ml-site.cdn-apple.com/datasets/egodex/part1.zip" -o part1.zip
unzip part1.zip
...
Within each zip file are folders named by task, and within each task folder is a set of paired HDF5 files and MP4 files. Corresponding files have the same index (e.g., 0.hdf5
and 0.mp4
). The pose annotations at each frame of the MP4 file are contained in the corresponding HDF5 file. The files are structured as follows:
part1
└──task1
└──0.hdf5
└──0.mp4
└──1.hdf5
└──1.mp4
...
└──task2
└──0.hdf5
└──0.mp4
...
...
test
└──task1
└──0.hdf5
└──0.mp4
└──task2
...
Each HDF5 file has the structure below, where N
is the number of frames.
camera
└──intrinsic # 3 x 3 camera intrinsics. always the same in every file.
transforms # all joint transforms, all below have shape N x 4 x 4.
└──camera
└──leftHand
└──rightHand
└──leftIndexFingerTip
└──leftIndexFingerKnuckle
└──(64 more joints...)
confidences # (optional) scalar joint confidences, all below have shape N.
└──leftHand
└──rightHand
└──(66 more joints...)
If the corresponding MP4 file is T
seconds long, then N = 30 * T
. The first transform of each joint corresponds to the first frame of the video. The file contains skeletal SE(3) pose data for all joints, estimated using ARKit on visionOS. Note that all transforms (including the camera extrinsics, transforms/camera
) are expressed in the ARKit origin frame: a stationary frame on the ground set at the beginning of a recording session. Since this depends on device initialization, this world frame is not necessarily consistent across episodes (though it is stationary during an episode).
Most (but not all!) HDF5 files also contain confidences, a scalar value between 0 and 1 indicating how confident the ARKit model is in its prediction. A confidence of zero indicates that the joint is fully occluded or otherwise not detected.
Lastly, language metadata annotations can be accessed under the HDF5 file attributes. In Python, if f
is the hdf5 file, you can access this with f.attrs['llm_description']
. Reversible tasks also have a f.attrs['llm_description2']
. In this case, you can determine which description applies to that particular episode with f.attrs['which_llm_description']
, which will be either 1
for llm_description
or 2
for llm_description2
. Note that there may be some errors in llm_description
and which_llm_description
as they are auto-generated by a LLM and VLM respectively (specifically, GPT-4).
The sample code provided in this repository consists of a few pedagogical examples for how to access and use the data in Python. The purpose of the code is to provide some intuition on how the data may be used rather than a comprehensive codebase for large-scale training. Feel free to adapt it for your use case as desired.
simple_dataset.py
: A minimalist PyTorch dataset that loads MP4 files with torchcodec and HDF5 files with h5py. Gives some intuition on how EgoDex data may be loaded into PyTorch.
visualize_2d.py
: A script for visualizing the 3D skeletal annotations by re-projecting them into the 2D images. This is an example of converting the pose data from the default ARKit origin frame into the camera frame (which may also be desirable during learning, as in EgoMimic). NOTE: the 2D reprojections may not exactly match the hand joints in the video, but that does not necessarily mean that the 3D annotation is wrong. There is a perspective mismatch introduced by how the RGB video is synthesized from multiple cameras on Vision Pro.
visualize_3d.py
: Script for visualizing skeletal annotation data in 3D.
compute_metrics.py
: Function for evaluating the "best-of-K" distance metrics from the EgoDex paper, which evaluates the quality of dexterous trajectory prediction and facilitates comparison with the benchmark scores in the paper.
To run the code, simply start up a new virtual Python environment and install dependencies:
conda create --name egodex python==3.11
conda activate egodex
conda install -c conda-forge ffmpeg=7.1.1
pip install -r requirements.txt
python visualize_2d.py --data_dir [path to egodex data] # sample script, could also try visualize_3d.py
Here is a non-exhaustive list of third-party projects using EgoDex that you may find useful.
EgoDex Viewer: A live Gradio app hosted on HuggingFace Spaces for quick and easy visualization of the EgoDex test set data.
H-RDT: Human-to-Robotics Diffusion Transformer, a state-of-the-art robot foundation model that pretrains on EgoDex data. Includes open-source code for processing and training on EgoDex data.
Being-H0: a vision-language-action (VLA) model that pretrains on EgoDex and other human datasets by processing the annotations with the MANO hand model.
The code in this repository is released under the terms detailed in LICENSE. The dataset is available under CC-by-NC-ND terms.
If you find this code or data useful, please cite the EgoDex paper:
@misc{egodex,
title={EgoDex: Learning Dexterous Manipulation from Large-Scale Egocentric Video},
author={Ryan Hoque and Peide Huang and David J. Yoon and Mouli Sivapurapu and Jian Zhang},
year={2025},
eprint={2505.11709},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2505.11709},
}