18 Iterations to Beat Hover: What We Learned Engineering Drone Autonomy

We trained a drone navigation model on 6.7 million frames of drone-perspective footage, fine-tuned two state-of-the-art detectors, and achieved a 9.7× improvement in detection accuracy. Then we deployed it — and performance was identical to the untrained baseline. Every single time.

That result, replicated across four separate fine-tuning attempts, is the most important thing we learned in eighteen iterations of autonomous drone navigation research. This post documents the full arc — what we built, what broke, what the data actually said, and what we would do differently.

Where We Started

The initial modular pipeline result: 9.98 m aggregate error. Hover baseline: 9.50 m. We had built a complex system that performed worse than doing nothing. The pipeline had to beat hover. Until it did, nothing else mattered.

Phase I: Pipeline Engineering (Iterations 3–11)

Six targeted fixes moved aggregate error from 9.98 m to 8.15 m with >90% collision-free flight: instruction decomposition, domain-specific detection priming, failure detection, spatial semantic memory, active yaw-based perception, and environment-specific safety profiles.

One diagnostic finding from this phase: mock evaluation overestimated performance by 6.5× compared to closed-loop results. If you are benchmarking drone navigation against replayed data rather than live closed-loop runs, your numbers are not real.

The Yonder Dataset

We built Yonder: 6.7 million frames from drone-perspective viewpoints across 275 indoor environments with 32 million COCO-format bounding box annotations. Fine-tuning on Yonder produced a 9.7× detection mAP improvement: 4.8% → 46.7%. By any offline measure, this solved detection.

Except it did not. The dataset is available at astral.us/datasets/yonder.

The Domain Gap Trap

Four fine-tuning runs. Four different hyperparameter configurations. One result each time: 23–25% navigation success rate, statistically indistinguishable from the zero-shot baseline. The 9.7× mAP improvement produced exactly zero improvement in closed-loop navigation.

Root cause: cross-simulator domain gap. Yonder was generated in Habitat-Sim. Evaluation runs in Isaac Sim. These simulators render the same scenes differently, and detectors learned the Habitat-Sim distribution, not the underlying world. The diagnostic failure mode: the drone produced negative-Z goal predictions — interpreting its target as below the floor — and immediately pitched into the ground. This happened consistently across all four fine-tuned checkpoints. Zero-shot models never did this.

The implication: training data generated in one simulation platform cannot transfer to a different simulation platform for closed-loop navigation, even when scenes and categories are identical. Offline mAP measured in a different renderer is not a proxy for navigation performance.

Phase III: Exploration Was the Real Bottleneck

75% of target objects were not visible from the drone's spawn position. Detection quality was irrelevant until visibility was established. A learned exploration policy achieved 6.9% validation accuracy versus 8.3% random chance — it learned the wrong thing. A depth-based heuristic produced +5.2 pp improvement, p=0.168. Not significant.

Across all configurations, success rate converged to 23–25%. The tasks that produced 0% success shared one property: they required reasoning, not perception. Spatial relational queries, negation, multi-step planning, and occluded target inference are architectural limitations, not data limitations.

What Actually Worked

At n=4 drones cooperatively, the modular pipeline achieved 70–72.5% success versus 5% for zero-shot VLM baseline (p=0.002). Multi-agent coverage, not shared perception, is the source of that gain. That result is robust and does not depend on the single-agent ceiling being solved.

What We Would Do Differently

Validate synthetic data in the evaluation environment, not a different one.
Run closed-loop evaluation early and often — it is the only number that is real.
Profile your benchmark before optimizing perception: if 75% of targets are invisible from spawn, improving detector mAP by 10× will not move your success rate.

Full technical detail: research documentation. Yonder dataset: astral.us/datasets/yonder.

Iterations	18
Training frames	6.7M
Detection mAP	4.8% → 46.7% (9.7×)

This paper is an engineering log, not a polished result. It records 18 iterations of a modular drone autonomy stack, including the failures — most of them, in detail. We publish it because the failures are more instructive than the successes.

The central finding

Fine-tuning a YOLOv8n detector on 6.7 million synthetic frames from Isaac Sim improved detection mAP 9.7× — from 4.8% to 46.7%. This is a large, unambiguous gain on the offline metric. Closed-loop navigation success did not improve. On the same set of navigation trials, success rates before and after fine-tuning were statistically indistinguishable.

That result forces a conclusion: detection accuracy was not the bottleneck. Something else was holding navigation back. We spent several iterations diagnosing it before isolating a cross-simulator localization gap — the training simulator and the evaluation simulator disagree on enough geometric details (object scale, floor reflectance, lighting model, corridor dimensions) that depth estimates computed in the evaluation environment are systematically wrong relative to the depth distribution the planner was trained to expect.

Iteration structure

The 18 iterations span three phases:

Baseline establishment (iterations 1–4): standing up the full perception–planning–control loop, verifying that commands reach the flight controller, establishing the hover baseline as the comparison point.
Detector scaling (iterations 5–12): data collection pipeline, synthetic frame generation at scale, fine-tuning protocol, offline evaluation, and the closed-loop non-result.
Gap diagnosis (iterations 13–18): systematic probes of the localization gap, partial mitigations, and characterization of exploration and planning as the next bottlenecks now that detection is no longer binding.

Why publish failure logs

Drone autonomy papers almost universally report final results on favorable conditions. The engineering decisions that were tried and discarded — and especially the diagnostic work that preceded those decisions — rarely appear in print. That creates a literature where every paper shows an improvement, and a practitioner reading it has no idea how many expensive dead ends were omitted.

We think the iteration log format serves the community better. If you are working on a similar stack and your fine-tuning isn't transferring to closed loop, this paper tells you what we checked, in what order, and what the localization gap diagnosis looks like.

The dataset used to generate training frames is Yonder. The benchmark that produced the 25-VLM results referenced here is documented in Closing the Metric Gap.