Research · 15 min read
18 Iterations to Beat Hover: What We Learned Engineering Drone Autonomy
June 9, 2026
We trained a drone navigation model on 6.7 million frames of drone-perspective footage, fine-tuned two state-of-the-art detectors, and achieved a 9.7× improvement in detection accuracy. Then we deployed it — and performance was identical to the untrained baseline. Every single time.
That result, replicated across four separate fine-tuning attempts, is the most important thing we learned in eighteen iterations of autonomous drone navigation research. This post documents the full arc — what we built, what broke, what the data actually said, and what we would do differently.
Where We Started
The initial modular pipeline result: 9.98 m aggregate error. Hover baseline: 9.50 m. We had built a complex system that performed worse than doing nothing. The pipeline had to beat hover. Until it did, nothing else mattered.
Phase I: Pipeline Engineering (Iterations 3–11)
Six targeted fixes moved aggregate error from 9.98 m to 8.15 m with >90% collision-free flight: instruction decomposition, domain-specific detection priming, failure detection, spatial semantic memory, active yaw-based perception, and environment-specific safety profiles.
One diagnostic finding from this phase: mock evaluation overestimated performance by 6.5× compared to closed-loop results. If you are benchmarking drone navigation against replayed data rather than live closed-loop runs, your numbers are not real.
The ZeroClaw Dataset
We built ZeroClaw: 6.7 million frames from drone-perspective viewpoints across 275 indoor environments with 32 million COCO-format bounding box annotations. Fine-tuning on ZeroClaw produced a 9.7× detection mAP improvement: 4.8% → 46.7%. By any offline measure, this solved detection.
Except it did not. The dataset is available at astral.us/datasets/yonder.
The Domain Gap Trap
Four fine-tuning runs. Four different hyperparameter configurations. One result each time: 23–25% navigation success rate, statistically indistinguishable from the zero-shot baseline. The 9.7× mAP improvement produced exactly zero improvement in closed-loop navigation.
Root cause: cross-simulator domain gap. ZeroClaw was generated in Habitat-Sim. Evaluation runs in Isaac Sim. These simulators render the same scenes differently, and detectors learned the Habitat-Sim distribution, not the underlying world. The diagnostic failure mode: the drone produced negative-Z goal predictions — interpreting its target as below the floor — and immediately pitched into the ground. This happened consistently across all four fine-tuned checkpoints. Zero-shot models never did this.
The implication: training data generated in one simulation platform cannot transfer to a different simulation platform for closed-loop navigation, even when scenes and categories are identical. Offline mAP measured in a different renderer is not a proxy for navigation performance.
Phase III: Exploration Was the Real Bottleneck
75% of target objects were not visible from the drone's spawn position. Detection quality was irrelevant until visibility was established. A learned exploration policy achieved 6.9% validation accuracy versus 8.3% random chance — it learned the wrong thing. A depth-based heuristic produced +5.2 pp improvement, p=0.168. Not significant.
Across all configurations, success rate converged to 23–25%. The tasks that produced 0% success shared one property: they required reasoning, not perception. Spatial relational queries, negation, multi-step planning, and occluded target inference are architectural limitations, not data limitations.
What Actually Worked
At n=4 drones cooperatively, the modular pipeline achieved 70–72.5% success versus 5% for zero-shot VLM baseline (p=0.002). Multi-agent coverage, not shared perception, is the source of that gain. That result is robust and does not depend on the single-agent ceiling being solved.
What We Would Do Differently
- Validate synthetic data in the evaluation environment, not a different one.
- Run closed-loop evaluation early and often — it is the only number that is real.
- Profile your benchmark before optimizing perception: if 75% of targets are invisible from spawn, improving detector mAP by 10× will not move your success rate.
Full technical detail: research documentation. ZeroClaw dataset: astral.us/datasets/yonder.
