Open benchmark

Closed-Loop Drone AI
Navigation Benchmark

Name: Astral Closed-Loop Drone AI Navigation Benchmark
Creator: Astral
Published: 2025-03-07
License: https://creativecommons.org/licenses/by/4.0/

We ran 10,200 closed-loop flight trials across 25 vision-language models in Isaac Sim. Most end-to-end models couldn't beat a drone that just hovered — and the best only beat it by 0.8 m. We published the results, the methodology, the dataset, and the models.

An open, reproducible standard for closed-loop drone navigation. The methodology, the Yonder dataset, and the scoring code are all public. Reproduce our numbers, extend the task tiers, or run your own architecture through the same harness. We add verified results to the table as they come in.

Submit a result Run it yourself Yonder dataset

10,200

Closed-loop flight trials

VLM architectures tested

1.04 m

Best result (modular, operational commands)

Results

Ranked by mean position error (lower is better). Hover baseline is 9.50 m — the null hypothesis every system must beat to justify moving. Full methodology below.

#	System	Type	Mean error	Collision rate	Notes
1	Astral Track A (modular)Astral	Modular stack	1.04 m	0%	Operational commands only. VLM as semantic selector + depth model for metric grounding.
2	Astral Track A (full benchmark)Astral	Modular stack	9.98 m	0%	All 153 trials including unsolved task types (occluded, multi-step). Tradeoff vs. hover to maintain 0% collisions.
3	Hover baseline	Null baseline	9.50 m	0%	Zero motion, zero intelligence. Correct null hypothesis: any system that moves must justify it with accuracy gains.
4	Gemma 4 E2B (modular)	VLM — modular	9.35 m	—	Same Gemma 4 weights as semantic target selector inside the modular pipeline. Beats hover.
5	Gemini 3 Flash (E2E)	VLM — end-to-end	8.70 m	—	Best end-to-end result in the 25-VLM lineup. Beats hover by 0.80 m — marginal.
6	Qwen 2.5 VL (E2E)	VLM — end-to-end	38.64 m*	—	*Closed-loop final error. Step-1 error is 8.11 m, but cumulative error reaches 38.64 m — model anchors on initial spatial estimate and compounds.
7	Gemma 4 E2B (E2E)	VLM — end-to-end	47.78 m	67%	17.8 s/step latency. Same weights as rank 4 above — architecture is everything.
8	Frontier VLMs (E2E, median)	VLM — end-to-end	~11–14 m	—	Median result across remaining 22 models in the 25-VLM lineup. All underperform hover on the full benchmark.
—	+ Submit your system

— = not measured or not applicable for this system. * = closed-loop cumulative error (see notes). All trials run in Isaac Sim unless noted in submission.

Methodology

What we measure

Mean position error (m) — Euclidean distance between the drone's final position and the target object centroid at trial end. Primary metric. Lower is better.

Collision rate (%) — Fraction of trials ending in a physics collision with any environment object.

Directional accuracy — Cosine similarity between predicted heading and ground-truth heading to target on the first step. Separates semantic understanding from metric grounding.

Step-1 prediction error — Position error on the first waypoint output only, before closed-loop compounding.

Trial structure

153 distinct task trials, each repeated across evaluation runs. Tasks are organized into five tiers by difficulty:

Tier 1 — Stationary targets, unambiguous category name
Tier 2 — Semantic identification (color, size, type)
Tier 3 — Visual grounding (relative position)
Tier 4 — Occluded objects
Tier 5 — Multi-step reasoning and negation

75% of target objects are not visible from the drone's spawn position — exploration is required. The environment is an indoor warehouse in Isaac Sim. Drone platform: simulated Jetson Orin Nano compute budget.

The hover baseline

The hover baseline outputs zero velocity at every step. It achieves 9.50 m mean error (the average distance from spawn to target across all trials) and 0% collision rate. It is the correct null hypothesis: any system that moves must justify that motion with a reduction in position error. Beating hover is the minimum bar for deployment readiness. In our 25-VLM lineup, the best end-to-end result beat hover by 0.80 m. Most did not beat it at all.

We report hover as rank 3 rather than rank 1 because the Astral modular stack and the Gemma 4 modular result both beat it. Hover is not the goal — it is the floor.

Full methodology paper Detailed results write-up

Run it yourself

The benchmark is fully reproducible. You need Isaac Sim, the Astral SDK, and the Yonder evaluation split. Everything else is open source.

1. Prerequisites

Isaac Sim 4.x — the evaluation environment. Free for research. Download from NVIDIA.
Astral SDK — the evaluation harness, task definitions, and scoring scripts. github.com/astral-us/astral-sdk.
Yonder evaluation split — the held-out evaluation scenes. astralhf/yonder on Hugging Face.
Python 3.10+, CUDA 12.x, 16 GB VRAM minimum (for running VLMs locally).

2. Install

# Clone the SDK

git clone https://github.com/astral-us/astral-sdk.git && cd astral-sdk

# Install dependencies

pip install -e ".[benchmark]"

# Download Yonder eval split (~2 GB)

python scripts/download_yonder.py --split eval

3. Run the baseline

Reproduce our hover baseline and Track A results first to verify your setup matches ours:

# Hover baseline (should give ~9.50 m mean error)

python benchmark/run.py --policy hover --trials 153 --output results/hover.json

# Astral Track A modular stack

python benchmark/run.py --policy astral_track_a --trials 153 --output results/track_a.json

# Score and compare

python benchmark/score.py results/hover.json results/track_a.json

4. Plug in your own model

Implement the DronePolicy interface and pass it to the runner. The interface is intentionally minimal — receive a frame and goal string, return a 3D waypoint:

# benchmark/policies/my_policy.py

from astral.benchmark import DronePolicy, Observation

import numpy as np

class MyPolicy(DronePolicy):

def predict(self, obs: Observation) -> np.ndarray:

# obs.image: (H, W, 3) uint8 RGB

# obs.goal: str ("fly to the red crate")

# obs.depth: (H, W) float32, metres (if available)

# return: [x, y, z] waypoint in drone frame, metres

waypoint = your_model(obs.image, obs.goal)

return waypoint

python benchmark/run.py --policy my_policy.MyPolicy --trials 153

5. Compare against published results

# Download published result files for comparison

python benchmark/score.py results/my_policy.json --compare published

# This prints: mean error, collision rate, per-tier breakdown,

# directional accuracy, and comparison against hover + Track A.

Known limitations to disclose in submissions

All trials are in Isaac Sim. Real-world transfer performance is not measured here and sim-to-real gap varies by architecture.
Mock evaluation overstates performance by ~6.5× vs. closed-loop (validated in engineering iteration log). Submissions based on replayed data will not be accepted.
Tiers 4 and 5 (occluded, multi-step) are unsolved by all current systems including ours. Per-tier breakdown is required in submissions.

Submit a result

Run your architecture through the same harness and send us the results file. We verify submissions against the scoring code and add them to the table above, credited to your org or team. Negative results are welcome — if you tried something and it failed, that is as useful as a win.

Required in your submission

Results file from benchmark/run.py (JSON)
Mean position error, collision rate, per-tier breakdown
System description: architecture type (E2E / modular / other), model(s) used, compute budget
Confirmation that trials ran closed-loop in Isaac Sim, not replayed data
Link to code, paper, or write-up (preprint is fine)

What we do with it

Verify the results file is consistent with the reported numbers
Add your result to the table above, credited to your org/team
We do not gatekeep negative results — if you tried something and it failed, that is as useful as a win

Email hello@astral.us Open an issue on GitHub

You keep all rights to your work. We ask only for permission to list your result on this page with a link back to you.

Cite this work

If you use the benchmark or the Yonder dataset in your research, please cite:

Benchmark

@misc{astral2025benchmark,
  title  = {Closing the Metric Gap: A Closed-Loop Benchmark for
             Vision-Language Drone Navigation},
  author = {Astral},
  year   = {2025},
  url    = {https://astral.us/benchmark},
  note   = {25 VLMs, 10{,}200 closed-loop Isaac Sim flight trials}
}

Yonder dataset

@dataset{astral2026yonder,
  title     = {Yonder: A 4.65M-Frame Drone-Perspective Indoor
               Navigation Dataset},
  author    = {Astral},
  year      = {2026},
  publisher = {Hugging Face},
  url       = {https://huggingface.co/datasets/astralhf/yonder},
  license   = {CC-BY-NC-4.0}
}

Full results write-up

Why every VLM lost to hovering and the architecture that fixed it

Yonder dataset

4.65M-frame drone navigation dataset used for evaluation

18-iteration engineering log

How the Track A architecture was built, iteration by iteration

Code & SDK

Benchmark runner, SDK, and simulation code on GitHub