ADL4D - 4D Human Activity Dataset
Towards a Contextually Rich Dataset for 4D Activities of Daily Living
Overview
ADL4D bridges the gap between isolated object interactions and complex, real-world activity planning.
While previous benchmarks focused on single “atomic” actions (like holding a mug), ADL4D captures the Action Plan: the messy, continuous flow where a user opens a fridge, moves items, pours milk, and hands it to a partner. This project introduces a large-scale dataset of two-subject, multi-object interactions and a novel machine vision pipeline to annotate them.
Technical Innovation: Automated ReID for Triangulation
Annotating heavy occlusion scenarios in an 8-15 camera setup is a “degenerate” geometric problem.
The Problem: Ghost Hands
Standard triangulation relies on epipolar geometry: if you see a point in Camera A, its corresponding point in Camera B must lie on a specific line.
- Failure Mode: In multiview geometry, algorithmic epipolar matching fails when a subject’s landmarks in one view pass over epipolar lines from a different hand (either their own or a second subject’s) captured from another view, especially when the target hand is occluded in the first view. This degenerate case repeats frequently in sparse multi-camera scenarios attempting to cover 360-degree scenes. The effect is further exacerbated by our focus on hands, multiple subjects, and complex inter-subject interactions.
- Result: Naive triangulation matches these disparate points, creating “Ghost Hands”—3D clusters that vanish or explode when the subjects move.
The Solution: Robust 3D Hand Identification
We developed a Dynamic Matching algorithm that treats triangulation as a Re-Identification (ReID) problem rather than just a geometric one.
- Subspace Clustering: Instead of matching points directly, we generate all possible 3D candidates (including valid hands and ghost hands) and cluster them in 3D space.
- Temporal Consistency (Tracking Mode): We propagate the unique identity of a hand cluster from previous frames. If a hand is visible in only 2 cameras (normally insufficient for stable clustering), our algorithm uses the projected trajectory from the previous frame to “lock” the identity.
- Human-in-the-Loop: We built a custom GUI where a human validator can visually verify the “locked” tracks. Our entire test set is annotated with human supervision. The Training and Validation sets are unsupervised, and the pipeline masks out frames where it was unable to cluster and triangulate the hand correctly (usually a very small number of frames).
Achievements & Impact
The primary contribution of ADL4D is the Quality and Variety of the data. By capturing long-form interactions, we generate poses that standard “atomic” datasets miss.
- Absolute Pose Accuracy: Validated on external datasets (H2O, DexYCB).
- Scale: 1.1 Million frames of annotated RGB-D data.
- Diversity: Includes “in-between” actions—transitions, handovers, and idle adjustments—that are critical for training robust robots.
- Annotation Robustness: In a challenge using off-the-shelf MediaPipe, clustering without our method resulted in 1088 skipped frames on ADL4D, whereas utilizing our robust tracking method reduced this to just 22 skipped frames. On the H2O dataset, this impact is even more pronounced (6302 skipped vs. 213 skipped).
Pose Variety
Our dataset covers a significantly wider distribution of hand poses compared to existing benchmarks (H2O, DexYCB).
Absolute Pose Accuracy
We validated our annotation robustness by testing on external datasets, achieving state-of-the-art accuracy.
Metrics
| Dataset | abs MPJPE (mm) | AUC |
|---|---|---|
| H2O | 5.36 | 0.8930 |
| DexYCB | 8.56 | 0.8651 |
Our “Tracking Mode” with Reprojection (Repr) criterion achieves the lowest error.
Qualitative
Our automated pipeline generates annotations that closely match ground truth, even in challenging dynamic scenarios.
Downstream Tasks
We benchmarked the dataset on three critical computer vision tasks.
1. Hand Mesh Recovery (HMR)
HMR Model Quality
We achieved high-quality hand reconstruction results using ADL4D.
Cross-Dataset Generalization
Models trained on ADL4D generalize significantly better to unseen datasets.
| Train Set | Test Set | Error (MPJPE mm) |
|---|---|---|
| DexYCB | H2O | 44.96 |
| ADL4D | H2O | 32.76 |
2. Hand Action Segmentation
Using the precise 3D pose history from ADL4D enhances action segmentation. Pose features prove superior (57.15% Acc.) as they remain invariant to the dynamic background motion inherent in multi-view capture systems, whereas standard video features (I3D/X3D) struggle.
| Features | Acc. | Edit | F1@10 | F1@25 | F1@50 |
|---|---|---|---|---|---|
| I3D | 32.77 | 41.66 | 24.59 | 18.21 | 7.12 |
| X3D | 28.99 | 34.15 | 28.5 | 19.27 | 6.85 |
| SF | 45.02 | 40.78 | 36.73 | 28.86 | 16.94 |
| Pose (ADL4D) | 57.15 | 53.19 | 56.77 | 50.89 | 35.81 |
3. Zero-Shot Object Pose Tracking
We evaluated zero-shot object pose tracking methods on ADL4D test sequences.
| Model | ADD | ADD-S |
|---|---|---|
| FoundationPose | 0.47 | 0.64 |
| ICG+ | 0.53 | 0.74 |
Qualitative analysis suggests ICG+ provides smoother predictions during severe hand-object occlusions compared to FoundationPose.
Demos
ADL4D Sequences
DexYCB Sequences
Sequence 4
Sequence 6
H2O
Conclusion & Future Improvements
ADL4D demonstrates that context is the missing link in Human-Object Interaction. By capturing the full “Action Plan”—preparation, interaction, transition, and conclusion—we provide a benchmark that forces models to learn the temporal logic of activity.
Looking forward, we identify several key areas for evolution:
- Egocentric Perspectives: While our studio setup supports egocentric capture, this dataset focused on third-person views. Integrating ego-centric cameras from the subjects’ point of view would provide a critical “user-eye” signal for imitation learning.
- Dense Landmark Models: The current pose estimation pipeline could be significantly enhanced by adopting emerging dense landmark models, offering finer granularity than sparse keypoints.
- Expanded Data Variety: Future iterations should expand beyond kitchen-focused scenarios to cover a broader range of daily living environments and interaction types.
The release of this dataset serves as a foundational step for the next generation of “context-aware” robotic assistants.