The Metric Gap in Vision-Language Drone Navigation: What Actually Breaks

Vision-language models are excellent at parsing operator intent: object names, relations, negation, and visually grounded phrases. Drone navigation, however, requires something VLMs are not trained to deliver reliably: metric spatial grounding from a moving camera under physical dynamics.

What we measured

In our benchmark work ("Closing the Metric Gap"), we ran thousands of closed-loop quadrotor trials in NVIDIA Isaac Sim across many VLM architectures. The headline empirical pattern is a metric gap: direction can look good while distance to the target is wrong by meters. Under replanning, those errors diverge.

Why "end-to-end VLM pilot" fails first

Internet-scale image-text training does not give a model stable pixel-to-meter mapping for a drone camera at operational ranges. That is not a moral failure of VLMs; it is a supervision mismatch. Treating a VLM as a direct coordinate generator therefore stacks the hardest problem on the least appropriate module.

The separation principle

A better engineering contract is: VLMs for semantic target selection, dedicated depth and detection for metric localization, classical motion planning for feasible, collision-aware motion. The follow-on engineering paper documents how far that modular stack can be pushed, and where the next bottlenecks appear (exploration, multi-step plans, occlusions).

A clean single-model illustration

Our Gemma 4 pilot report shows the same weights deployed two ways: end-to-end goal prediction versus modular use as a target identifier. The gap is enormous, because the interface between "language" and "geometry" is doing real work.

Read the full paper summaries on Research, and see runnable simulation setup on Simulation.