Closed-Loop Drone AI
Navigation Benchmark
We ran 10,200 closed-loop flight trials across 25 vision-language models in Isaac Sim. Every end-to-end model lost to a drone that just hovered. We published the results, the methodology, the dataset, and the models.
We want to be beaten. Submit your architecture. If you outperform our modular stack, we will put your result at the top of this table and write about it. The goal is an honest leaderboard for the whole community — not a trophy case.
Results
Ranked by mean position error (lower is better). Hover baseline is 9.50 m — the null hypothesis every system must beat to justify moving. Full methodology below.
| # | System | Type | Mean error | Collision rate | |
|---|---|---|---|---|---|
| 1 | Astral Track A (modular)Astral | Modular stack | 1.04 m | 0% | |
| 2 | Astral Track A (full benchmark)Astral | Modular stack | 9.98 m | 0% | |
| 3 | Hover baseline | Null baseline | 9.50 m | 0% | |
| 4 | Gemma 4 E2B (modular) | VLM — modular | 9.35 m | — | |
| 5 | Gemini 3 Flash (E2E) | VLM — end-to-end | 8.70 m | — | |
| 6 | Qwen 2.5 VL (E2E) | VLM — end-to-end | 38.64 m* | — | |
| 7 | Gemma 4 E2B (E2E) | VLM — end-to-end | 47.78 m | 67% | |
| 8 | Frontier VLMs (E2E, median) | VLM — end-to-end | ~11–14 m | — | |
| — | + Submit your system | ||||
— = not measured or not applicable for this system. * = closed-loop cumulative error (see notes). All trials run in Isaac Sim unless noted in submission.
Methodology
What we measure
Mean position error (m) — Euclidean distance between the drone's final position and the target object centroid at trial end. Primary metric. Lower is better.
Collision rate (%) — Fraction of trials ending in a physics collision with any environment object.
Directional accuracy — Cosine similarity between predicted heading and ground-truth heading to target on the first step. Separates semantic understanding from metric grounding.
Step-1 prediction error — Position error on the first waypoint output only, before closed-loop compounding.
Trial structure
153 distinct task trials, each repeated across evaluation runs. Tasks are organized into five tiers by difficulty:
- Tier 1 — Stationary targets, unambiguous category name
- Tier 2 — Semantic identification (color, size, type)
- Tier 3 — Visual grounding (relative position)
- Tier 4 — Occluded objects
- Tier 5 — Multi-step reasoning and negation
75% of target objects are not visible from the drone's spawn position — exploration is required. The environment is an indoor warehouse in Isaac Sim. Drone platform: simulated Jetson Orin Nano compute budget.
The hover baseline
The hover baseline outputs zero velocity at every step. It achieves 9.50 m mean error (the average distance from spawn to target across all trials) and 0% collision rate. It is the correct null hypothesis: any system that moves must justify that motion with a reduction in position error. Beating hover is the minimum bar for deployment readiness. In our 25-VLM lineup, the best end-to-end result beat hover by 0.80 m. Most did not beat it at all.
We report hover as rank 3 rather than rank 1 because the Astral modular stack and the Gemma 4 modular result both beat it. Hover is not the goal — it is the floor.
Run it yourself
The benchmark is fully reproducible. You need Isaac Sim, the Astral SDK, and the Yonder evaluation split. Everything else is open source.
1. Prerequisites
- Isaac Sim 4.x — the evaluation environment. Free for research. Download from NVIDIA.
- Astral SDK — the evaluation harness, task definitions, and scoring scripts. github.com/astral-us/astral-sdk.
- Yonder evaluation split — the held-out evaluation scenes. astralhf/yonder on Hugging Face.
- Python 3.10+, CUDA 12.x, 16 GB VRAM minimum (for running VLMs locally).
2. Install
3. Run the baseline
Reproduce our hover baseline and Track A results first to verify your setup matches ours:
4. Plug in your own model
Implement the DronePolicy interface and pass it to the runner. The interface is intentionally minimal — receive a frame and goal string, return a 3D waypoint:
5. Compare against published results
Known limitations to disclose in submissions
- All trials are in Isaac Sim. Real-world transfer performance is not measured here and sim-to-real gap varies by architecture.
- Mock evaluation overstates performance by ~6.5× vs. closed-loop (validated in engineering iteration log). Submissions based on replayed data will not be accepted.
- Tiers 4 and 5 (occluded, multi-step) are unsolved by all current systems including ours. Per-tier breakdown is required in submissions.
Submit your result
We genuinely want to be beaten. If your architecture outperforms ours, we will put your result at the top of the table, link to your paper or repo, and write about what you did differently. The point of this benchmark is to find out what actually works on closed-loop drone navigation — not to defend our own numbers.
Required in your submission
- Results file from
benchmark/run.py(JSON) - Mean position error, collision rate, per-tier breakdown
- System description: architecture type (E2E / modular / other), model(s) used, compute budget
- Confirmation that trials ran closed-loop in Isaac Sim, not replayed data
- Link to code, paper, or write-up (preprint is fine)
What we do with it
- Verify the results file is consistent with the reported numbers
- Add your result to the table above, credited to your org/team
- If you beat our best result, we write a post about it
- We do not gatekeep negative results — if you tried something and it failed, that is as useful as a win
You keep all rights to your work. We ask only for permission to list your result on this page with a link back to you.
