Astral
Back to research

Technical report

Closing the Metric Gap: From Diagnosis to Solution in Vision-Language Drone Navigation

TL;DR

  • A large-scale closed-loop benchmark across 25 vision-language models — every one underperformed a hovering baseline.
  • Failures decompose into semantic understanding vs. metric spatial grounding; the metric gap dominates.
  • A modular architecture separating semantics from geometry closes the gap on operational commands.
VLMs tested25
Closed-loop flight trials10,200
Baseline that wonhover (do nothing)

Abstract

Large-scale closed-loop benchmark across many VLMs, decomposing failures into semantic understanding versus metric spatial grounding, and a modular architecture that closes the gap on operational commands while prioritizing collision-free flight.

The most counterintuitive result in this paper is also the most reproducible: across 25 vision-language models and 10,200 closed-loop flight trials, every model we tested scored worse than a drone that simply hovered in place. That is not a criticism of the models. It is a diagnosis of an architectural mismatch.

The benchmark setup

We ran each of 25 VLMs as an end-to-end navigation controller in Isaac Sim. The drone is given a natural-language goal ("fly to the red crate"), shown an egocentric RGB frame, and asked to output a navigation command. We measure whether it reaches the target without collision. The hover baseline does nothing — it outputs zero velocity at every step. On most trials, hovering scores zero (it never reaches the target), but in a collision-penalized scoring scheme, it scores better than most models because it never crashes.

10,200 trials is not a large number by machine-learning standards. It is, however, enough to establish statistical significance across 25 models on a binary outcome. The point is not scale for its own sake — it is that the result held across every model we tried, including the strongest commercial VLMs available at time of writing.

Decomposing the failures

When we instrument the failure modes, they split into two categories:

  • Semantic failures: the model misidentifies the target object, misreads the instruction, or generates a plausible but wrong goal. These are the failures VLM researchers usually study and benchmark.
  • Metric grounding failures: the model correctly identifies the target but outputs a navigation command with the wrong scale or direction in metric space — off by 30°, or 2 meters instead of 0.5 meters. These are smaller errors but they compound. A 10% heading error per step means you are 180° off in 18 steps.

The metric gap is the second category. We call it a gap because it is the distance between what the model knows — rich, accurate semantic understanding of the visual scene — and what the drone needs — a precise metric displacement vector. General-purpose VLMs are not trained to output metric coordinates. They are trained on internet text and images where spatial precision is rarely required.

The modular fix

The architecture that closes the gap separates the two jobs. The VLM answers the semantic question: which object is the target? A metric depth module — RealSense or any depth sensor — answers the metric question: where is that object in 3D space? The planner and safety layers handle motion execution. The VLM never has to output a coordinate; it only has to identify an object in the image.

In the modular configuration, the same VLM weights that failed as an end-to-end controller achieve competitive navigation success. The Gemma 4 pilot repeats this comparison on a single model and shows the leverage clearly.

The full architecture and its iterative development are documented in Engineering the Separation Principle.