Why Every Drone AI We Tested Lost to Doing Nothing (And What Fixed It)

We ran 10,200 closed-loop flight trials across 25 vision-language model architectures, and the result was uncomfortable: in aggregate, no end-to-end 7–8B parameter VLM reliably outperforms a drone that simply hovers in place. Not occasionally loses. Reliably loses. The hover baseline — zero motion, zero intelligence — sits at 9.50 m mean error. Our best end-to-end VLM result was 8.70 m. That is not a success story.

This post explains why that happened, what the data shows about the root cause, and how a change in architecture — not model scale — closes the gap to 1.04 m on operational commands.

The Benchmark

The evaluation covered 153 distinct task trials organized into tiers by difficulty: stationary targets, semantic identification, visual grounding, occluded objects, and multi-step reasoning. Each trial ran closed-loop — the model received camera frames, issued waypoints, and the flight controller executed them in simulation. We measured final position error, step-1 prediction error, directional accuracy, and collision rate.

The hover baseline exists because it represents the correct null hypothesis for drone AI. If your model cannot beat standing still, it is not ready for deployment. A hovering drone risks nothing, damages nothing, and stays at a known position. Any system that moves must justify that motion with accuracy gains. Across our 25 tested architectures, most could not.

The models tested spanned the current frontier: Gemini 3 Flash, Qwen 2.5 VL, Gemma 4, LLaVA variants, InternVL, and others in the 2B–8B active-parameter range. All were evaluated as direct image-plus-text to waypoint controllers — the standard end-to-end framing where the VLM receives a frame and a natural language goal, then outputs a 3D target position for the flight controller to reach.

The Metric Gap

The failure has a specific name and a specific cause. We call it the metric gap, and understanding it requires separating two distinct capabilities: directional reasoning and distance estimation.

VLMs are surprisingly good at direction. Across the benchmark, models achieved 0.83–0.91 directional cosine accuracy — meaning when asked to fly toward a forklift, they correctly identify which way to turn roughly 85–90% of the time. That is genuine capability built from internet-scale visual pretraining.

VLMs are catastrophically bad at distance. The same models produce 6–10 m distance errors on targets that are 4–12 m away. The drone understands it needs to go left and forward, but has no reliable sense of whether the target is 3 m away or 15 m away. It guesses, and the guess is often wrong by a factor of two or three.

The root cause is training data. VLMs are trained on internet images, which contain no metric depth supervision. Models learn relative scale but not absolute metric mapping from pixel geometry to real-world distances. When asked to produce a waypoint in meters, they are extrapolating from a signal they never directly learned.

What the Numbers Look Like

Hover baseline: 9.50 m mean error, 100% collision-free. Best end-to-end VLM: Gemini 3 Flash at 8.70 m — barely clearing hover by 0.80 m. Other frontier models range 8.70–13.69 m. Several are meaningfully worse than hovering.

Closed-loop degradation compounds the problem. Qwen 2.5 VL's first prediction error is 8.11 m — marginal, but not catastrophic. By the end of a closed-loop trial, cumulative error reaches 38.64 m. The model anchors on its initial spatial estimate and compounds the error with each replanning step.

Gemma 4 Confirms the Pattern

Gemma 4 E2B as end-to-end controller: 47.78 m final error, 17.8 s/step latency, 67% collision rate. The same Gemma 4 weights inside the modular pipeline as a target-selector: 9.35 m — beating hover. The end-to-end versus modular gap for the identical model is 38.4 m. The model is not the problem. The architecture is. Full results at research: Gemma 4 pilot.

What Fixed It: The Separation Principle

The solution is not a better VLM. It is a different question asked of the VLM. The modular Track A architecture separates responsibilities explicitly:

VLM handles semantics only. Given a natural language goal, the VLM outputs a target label or bounding box selection — no coordinates, no distances.
Depth Anything V2 handles geometry. A dedicated monocular depth model maps pixel distances to real-world meters — the supervision the VLM never received.
Geometric planning handles navigation. Given the semantic selection and the metric depth estimate, a classical geometry solver computes the safe waypoint.

Track A achieves 1.04 m mean error on operational commands. Across the full 153-trial benchmark: 9.98 m vs. hover's 9.50 m — a deliberate 0.48 m tradeoff to maintain 100% collision-free flight. Runs on a Jetson Orin Nano at ~$1,350 sensor cost.

What This Means for Operators

Ask any vendor for closed-loop trial data, not demo videos. Ask for collision rates on the full test set. Ask whether the architecture separates semantics from geometry. Ask what the system does on unsolved task types — no current system handles occluded target search or multi-step reasoning reliably, and any system claiming otherwise should provide per-tier data.

Full benchmark methodology, per-tier results, and architecture specifications are at research: metric gap.

Model	Gemma 4 E2B
Benchmark	Isaac Sim closed-loop
Lineup	25 VLMs

The 25-VLM benchmark in Closing the Metric Gap was run before Gemma 4 was available. When Google released Gemma 4 E2B, we added it to the same Isaac Sim closed-loop benchmark as a self-contained pilot trial. This note documents the results and draws out the architectural lesson.

Two modes on the same weights

We tested Gemma 4 E2B in two configurations:

End-to-end (E2E): the model receives an egocentric RGB frame and a natural-language goal, and outputs a navigation command directly. This is the configuration every model in the 25-VLM lineup used.
Modular (semantic selector): the model receives the same frame and goal, but outputs only a target object identification — which object in the scene is the goal. A separate metric depth module converts that identification to a 3D coordinate, and a classical planner executes the motion.

Identical weights. Different architectural role. The difference in navigation outcomes is the most direct illustration of the separation principle we have.

Results

In the end-to-end configuration, Gemma 4 E2B follows the pattern of every other model in the lineup — it underperforms the hover baseline. In the modular configuration, the same weights achieve competitive navigation success. The model's semantic understanding is not the problem; the metric grounding task is.

We chose Gemma 4 for this comparison because it is a capable, openly-available model that many practitioners will have evaluated for their own applications. The finding is not specific to Gemma 4 — we expect the same architectural leverage to hold on any model with decent object recognition — but Gemma 4's wide familiarity makes it a useful reference point.

What this note adds to the full benchmark

The 25-VLM paper characterizes the failure mode and proposes the modular fix. This note shows the fix working on a single specific model in direct side-by-side comparison. The E2E vs. modular comparison on the same weights is a more controlled experiment than comparing across models with different architectures and training histories.

If you want the fuller quantitative picture, read Closing the Metric Gap. If you want the engineering record of how the modular architecture was developed over 18 iterations, read Engineering the Separation Principle.