NeurIPS 2026 Datasets & Benchmarks track (submission)
Yonder: A 4.65M-Frame Drone Navigation Dataset and the Cross-Simulator Generalization Gap
TL;DR
- Yonder is a 4.65M-frame drone-perspective indoor navigation dataset with stereo RGB, depth, IR, LiDAR-style, and semantic data.
- Offline detection gains do not reliably transfer to closed-loop navigation when training and evaluation simulators disagree geometrically.
- Released publicly on Hugging Face under CC-BY-NC-4.0.
| Frames | 4.65M |
|---|---|
| Sensing | stereo RGB, depth, IR, LiDAR-style, segmentation, pose |
| License | CC-BY-NC-4.0 |
| Venue | NeurIPS 2026 D&B (submission) |
Abstract
Introduces Yonder, a multi-million-frame drone-perspective indoor dataset with rich sensing, and shows why offline detection gains can fail to translate to closed-loop navigation when training and evaluation simulators disagree geometrically.
Benchmark datasets for drone autonomy usually measure one thing: whether a detector trained on dataset A performs on dataset A's test split. That is a reasonable starting point, but it answers the wrong question for deployed autonomy. The question that matters is whether a perception stack trained on data from one environment flies correctly through a different one. Yonder is built to answer that.
Why we built it
The proximate cause was a frustrating pattern in our own lab: we would fine-tune a detector, watch offline mAP climb, run a closed-loop flight trial in Isaac Sim, and see no improvement — or sometimes a regression. After several iterations we stopped assuming this was a detector problem and started instrumenting the full loop. The binding constraint was not detection accuracy. It was a geometric disagreement between the simulator used for training data and the simulator used for evaluation. Yonder is the infrastructure we built to make that gap measurable and reproducible.
What's in the dataset
Yonder contains 4.65 million frames of drone-perspective indoor navigation footage with synchronized sensing across six modalities:
- Stereo RGB at full navigation resolution
- Registered depth maps
- Infrared (IR) frames for low-light conditions
- LiDAR-style range data for metric grounding
- Semantic segmentation labels
- 6-DoF pose ground truth at every frame
The environments span corridors, open rooms, doorways, and multi-level indoor spaces — the geometry that shows up in real inspection, security, and logistics missions.
The cross-simulator generalization gap
The paper's central finding is not about the dataset itself. It is about what happens when you use a dataset collected in one simulation environment to train a model, then evaluate that model in a geometrically different simulation environment.
We quantify this gap across several detector architectures and show that offline detection metrics — mAP being the usual candidate — can rise sharply while closed-loop navigation success stays flat or falls. The two metrics measure different things. Offline mAP rewards detection accuracy on a fixed test distribution. Closed-loop success rewards the entire perceptual–planning–control loop under a distribution shift introduced by the new simulator's geometry, lighting model, and object placement.
The implication for practitioners: if you benchmark only offline, you can do everything right and still ship a drone that does not navigate. The detailed engineering record of this failure pattern is in our companion paper, Engineering the Separation Principle.
How to use Yonder
The full dataset is on Hugging Face at astralhf/yonder. A 500 MB sample (astralhf/yonder-sample) is available for quick evaluation. The dataset card documents the collection protocol, coordinate frames, label schema, and known edge cases.
For a more detailed discussion of the dataset structure and the offline-vs-closed-loop failure pattern, see the companion blog post: What Yonder contains and why offline mAP lies to you.
License
Yonder is released under CC-BY-NC-4.0. Research and non-commercial use are permitted with attribution. Contact us for commercial licensing.
