Yonder: A Large-Scale Drone Navigation Dataset and Why Offline mAP Lies to You

Yonder is a large drone-perspective dataset for indoor navigation research. It is built to support serious perception training (detection, depth, semantics) and to make a specific evaluation failure mode obvious: offline metrics on one simulator can mis-rank models for closed-loop flight in another.

What is inside

The public release includes millions of frames across many indoor environments, with rich sensor arrays per waypoint (stereo RGB, depth, LiDAR-style sweeps, semantics, pose). Full details and layout are on the Hugging Face dataset card; start with the smoke subset astralhf/yonder-sample if you want a small download before committing to large transfers.

What Yonder is for (and not for)

Great for: training and studying drone-perspective perception, and diagnosing cross-simulator generalization when paired with a closed-loop evaluator.
Not a substitute for: end-to-end policy training from expert trajectories; it is not packaged as behavior cloning data with full closed-loop rollouts.

Why this matters for AI drone software

The field has a habit of celebrating offline detection gains. Yonder includes the ingredients to show when those gains are real for flight and when they are an artifact of simulator-specific geometry and rendering conventions. If you care about trustworthy autonomy, publish both: offline metrics and closed-loop outcomes.

Dataset hub: https://huggingface.co/datasets/astralhf/yonder

Frames	4.65M
Sensing	stereo RGB, depth, IR, LiDAR-style, segmentation, pose
License	CC-BY-NC-4.0
Host	Hugging Face

Benchmark datasets for drone autonomy usually measure one thing: whether a detector trained on dataset A performs on dataset A's test split. That is a reasonable starting point, but it answers the wrong question for deployed autonomy. The question that matters is whether a perception stack trained on data from one environment flies correctly through a different one. Yonder is built to answer that.

Why we built it

The proximate cause was a frustrating pattern in our own lab: we would fine-tune a detector, watch offline mAP climb, run a closed-loop flight trial in Isaac Sim, and see no improvement — or sometimes a regression. After several iterations we stopped assuming this was a detector problem and started instrumenting the full loop. The binding constraint was not detection accuracy. It was a geometric disagreement between the simulator used for training data and the simulator used for evaluation. Yonder is the infrastructure we built to make that gap measurable and reproducible.

What's in the dataset

Yonder contains 4.65 million frames of drone-perspective indoor navigation footage with synchronized sensing across six modalities:

Stereo RGB at full navigation resolution
Registered depth maps
Infrared (IR) frames for low-light conditions
LiDAR-style range data for metric grounding
Semantic segmentation labels
6-DoF pose ground truth at every frame

The environments span corridors, open rooms, doorways, and multi-level indoor spaces — the geometry that shows up in real inspection, security, and logistics missions.

The cross-simulator generalization gap

The paper's central finding is not about the dataset itself. It is about what happens when you use a dataset collected in one simulation environment to train a model, then evaluate that model in a geometrically different simulation environment.

We quantify this gap across several detector architectures and show that offline detection metrics — mAP being the usual candidate — can rise sharply while closed-loop navigation success stays flat or falls. The two metrics measure different things. Offline mAP rewards detection accuracy on a fixed test distribution. Closed-loop success rewards the entire perceptual–planning–control loop under a distribution shift introduced by the new simulator's geometry, lighting model, and object placement.

The implication for practitioners: if you benchmark only offline, you can do everything right and still ship a drone that does not navigate. The detailed engineering record of this failure pattern is in our companion paper, Engineering the Separation Principle.

How to use Yonder

The full dataset is on Hugging Face at astralhf/yonder. A 500 MB sample (astralhf/yonder-sample) is available for quick evaluation. The dataset card documents the collection protocol, coordinate frames, label schema, and known edge cases.

For a more detailed discussion of the dataset structure and the offline-vs-closed-loop failure pattern, see the companion blog post: What Yonder contains and why offline mAP lies to you.

License

Yonder is released under CC-BY-NC-4.0. Research and non-commercial use are permitted with attribution. Contact us for commercial licensing.