Research · 14 min read
Why Every Drone AI We Tested Lost to Doing Nothing (And What Fixed It)
June 9, 2026
We ran 10,200 closed-loop flight trials across 25 vision-language model architectures, and the result was uncomfortable: in aggregate, no end-to-end 7–8B parameter VLM reliably outperforms a drone that simply hovers in place. Not occasionally loses. Reliably loses. The hover baseline — zero motion, zero intelligence — sits at 9.50 m mean error. Our best end-to-end VLM result was 8.70 m. That is not a success story.
This post explains why that happened, what the data shows about the root cause, and how a change in architecture — not model scale — closes the gap to 1.04 m on operational commands.
The Benchmark
The evaluation covered 153 distinct task trials organized into tiers by difficulty: stationary targets, semantic identification, visual grounding, occluded objects, and multi-step reasoning. Each trial ran closed-loop — the model received camera frames, issued waypoints, and the flight controller executed them in simulation. We measured final position error, step-1 prediction error, directional accuracy, and collision rate.
The hover baseline exists because it represents the correct null hypothesis for drone AI. If your model cannot beat standing still, it is not ready for deployment. A hovering drone risks nothing, damages nothing, and stays at a known position. Any system that moves must justify that motion with accuracy gains. Across our 25 tested architectures, most could not.
The models tested spanned the current frontier: Gemini 3 Flash, Qwen 2.5 VL, Gemma 4, LLaVA variants, InternVL, and others in the 2B–8B active-parameter range. All were evaluated as direct image-plus-text to waypoint controllers — the standard end-to-end framing where the VLM receives a frame and a natural language goal, then outputs a 3D target position for the flight controller to reach.
The Metric Gap
The failure has a specific name and a specific cause. We call it the metric gap, and understanding it requires separating two distinct capabilities: directional reasoning and distance estimation.
VLMs are surprisingly good at direction. Across the benchmark, models achieved 0.83–0.91 directional cosine accuracy — meaning when asked to fly toward a forklift, they correctly identify which way to turn roughly 85–90% of the time. That is genuine capability built from internet-scale visual pretraining.
VLMs are catastrophically bad at distance. The same models produce 6–10 m distance errors on targets that are 4–12 m away. The drone understands it needs to go left and forward, but has no reliable sense of whether the target is 3 m away or 15 m away. It guesses, and the guess is often wrong by a factor of two or three.
The root cause is training data. VLMs are trained on internet images, which contain no metric depth supervision. Models learn relative scale but not absolute metric mapping from pixel geometry to real-world distances. When asked to produce a waypoint in meters, they are extrapolating from a signal they never directly learned.
What the Numbers Look Like
Hover baseline: 9.50 m mean error, 100% collision-free. Best end-to-end VLM: Gemini 3 Flash at 8.70 m — barely clearing hover by 0.80 m. Other frontier models range 8.70–13.69 m. Several are meaningfully worse than hovering.
Closed-loop degradation compounds the problem. Qwen 2.5 VL's first prediction error is 8.11 m — marginal, but not catastrophic. By the end of a closed-loop trial, cumulative error reaches 38.64 m. The model anchors on its initial spatial estimate and compounds the error with each replanning step.
Gemma 4 Confirms the Pattern
Gemma 4 E2B as end-to-end controller: 47.78 m final error, 17.8 s/step latency, 67% collision rate. The same Gemma 4 weights inside the modular pipeline as a target-selector: 9.35 m — beating hover. The end-to-end versus modular gap for the identical model is 38.4 m. The model is not the problem. The architecture is. Full results at research: Gemma 4 pilot.
What Fixed It: The Separation Principle
The solution is not a better VLM. It is a different question asked of the VLM. The modular Track A architecture separates responsibilities explicitly:
- VLM handles semantics only. Given a natural language goal, the VLM outputs a target label or bounding box selection — no coordinates, no distances.
- Depth Anything V2 handles geometry. A dedicated monocular depth model maps pixel distances to real-world meters — the supervision the VLM never received.
- Geometric planning handles navigation. Given the semantic selection and the metric depth estimate, a classical geometry solver computes the safe waypoint.
Track A achieves 1.04 m mean error on operational commands. Across the full 153-trial benchmark: 9.98 m vs. hover's 9.50 m — a deliberate 0.48 m tradeoff to maintain 100% collision-free flight. Runs on a Jetson Orin Nano at ~$1,350 sensor cost.
What This Means for Operators
Ask any vendor for closed-loop trial data, not demo videos. Ask for collision rates on the full test set. Ask whether the architecture separates semantics from geometry. Ask what the system does on unsolved task types — no current system handles occluded target search or multi-step reasoning reliably, and any system claiming otherwise should provide per-tier data.
Full benchmark methodology, per-tier results, and architecture specifications are at research: metric gap.
