Astral
Open benchmark

Closed-Loop Drone AI
Navigation Benchmark

We ran 10,200 closed-loop flight trials across 25 vision-language models in Isaac Sim. Every end-to-end model lost to a drone that just hovered. We published the results, the methodology, the dataset, and the models.

We want to be beaten. Submit your architecture. If you outperform our modular stack, we will put your result at the top of this table and write about it. The goal is an honest leaderboard for the whole community — not a trophy case.

10,200
Closed-loop flight trials
25
VLM architectures tested
1.04 m
Best result (modular, operational commands)

Results

Ranked by mean position error (lower is better). Hover baseline is 9.50 m — the null hypothesis every system must beat to justify moving. Full methodology below.

#SystemTypeMean errorCollision rate
1Astral Track A (modular)AstralModular stack1.04 m0%
2Astral Track A (full benchmark)AstralModular stack9.98 m0%
3Hover baselineNull baseline9.50 m0%
4Gemma 4 E2B (modular)VLM — modular9.35 m
5Gemini 3 Flash (E2E)VLM — end-to-end8.70 m
6Qwen 2.5 VL (E2E)VLM — end-to-end38.64 m*
7Gemma 4 E2B (E2E)VLM — end-to-end47.78 m67%
8Frontier VLMs (E2E, median)VLM — end-to-end~11–14 m
+ Submit your system

— = not measured or not applicable for this system. * = closed-loop cumulative error (see notes). All trials run in Isaac Sim unless noted in submission.

Methodology

What we measure

Mean position error (m) — Euclidean distance between the drone's final position and the target object centroid at trial end. Primary metric. Lower is better.

Collision rate (%) — Fraction of trials ending in a physics collision with any environment object.

Directional accuracy — Cosine similarity between predicted heading and ground-truth heading to target on the first step. Separates semantic understanding from metric grounding.

Step-1 prediction error — Position error on the first waypoint output only, before closed-loop compounding.

Trial structure

153 distinct task trials, each repeated across evaluation runs. Tasks are organized into five tiers by difficulty:

  • Tier 1 — Stationary targets, unambiguous category name
  • Tier 2 — Semantic identification (color, size, type)
  • Tier 3 — Visual grounding (relative position)
  • Tier 4 — Occluded objects
  • Tier 5 — Multi-step reasoning and negation

75% of target objects are not visible from the drone's spawn position — exploration is required. The environment is an indoor warehouse in Isaac Sim. Drone platform: simulated Jetson Orin Nano compute budget.

The hover baseline

The hover baseline outputs zero velocity at every step. It achieves 9.50 m mean error (the average distance from spawn to target across all trials) and 0% collision rate. It is the correct null hypothesis: any system that moves must justify that motion with a reduction in position error. Beating hover is the minimum bar for deployment readiness. In our 25-VLM lineup, the best end-to-end result beat hover by 0.80 m. Most did not beat it at all.

We report hover as rank 3 rather than rank 1 because the Astral modular stack and the Gemma 4 modular result both beat it. Hover is not the goal — it is the floor.

Run it yourself

The benchmark is fully reproducible. You need Isaac Sim, the Astral SDK, and the Yonder evaluation split. Everything else is open source.

1. Prerequisites

2. Install

# Clone the SDK
git clone https://github.com/astral-us/astral-sdk.git && cd astral-sdk
# Install dependencies
pip install -e ".[benchmark]"
# Download Yonder eval split (~2 GB)
python scripts/download_yonder.py --split eval

3. Run the baseline

Reproduce our hover baseline and Track A results first to verify your setup matches ours:

# Hover baseline (should give ~9.50 m mean error)
python benchmark/run.py --policy hover --trials 153 --output results/hover.json
# Astral Track A modular stack
python benchmark/run.py --policy astral_track_a --trials 153 --output results/track_a.json
# Score and compare
python benchmark/score.py results/hover.json results/track_a.json

4. Plug in your own model

Implement the DronePolicy interface and pass it to the runner. The interface is intentionally minimal — receive a frame and goal string, return a 3D waypoint:

# benchmark/policies/my_policy.py
from astral.benchmark import DronePolicy, Observation
import numpy as np
class MyPolicy(DronePolicy):
    def predict(self, obs: Observation) -> np.ndarray:
        # obs.image: (H, W, 3) uint8 RGB
        # obs.goal: str ("fly to the red crate")
        # obs.depth: (H, W) float32, metres (if available)
        # return: [x, y, z] waypoint in drone frame, metres
        waypoint = your_model(obs.image, obs.goal)
        return waypoint
python benchmark/run.py --policy my_policy.MyPolicy --trials 153

5. Compare against published results

# Download published result files for comparison
python benchmark/score.py results/my_policy.json --compare published
# This prints: mean error, collision rate, per-tier breakdown,
# directional accuracy, and comparison against hover + Track A.

Known limitations to disclose in submissions

  • All trials are in Isaac Sim. Real-world transfer performance is not measured here and sim-to-real gap varies by architecture.
  • Mock evaluation overstates performance by ~6.5× vs. closed-loop (validated in engineering iteration log). Submissions based on replayed data will not be accepted.
  • Tiers 4 and 5 (occluded, multi-step) are unsolved by all current systems including ours. Per-tier breakdown is required in submissions.

Submit your result

We genuinely want to be beaten. If your architecture outperforms ours, we will put your result at the top of the table, link to your paper or repo, and write about what you did differently. The point of this benchmark is to find out what actually works on closed-loop drone navigation — not to defend our own numbers.

Required in your submission

  • Results file from benchmark/run.py (JSON)
  • Mean position error, collision rate, per-tier breakdown
  • System description: architecture type (E2E / modular / other), model(s) used, compute budget
  • Confirmation that trials ran closed-loop in Isaac Sim, not replayed data
  • Link to code, paper, or write-up (preprint is fine)

What we do with it

  • Verify the results file is consistent with the reported numbers
  • Add your result to the table above, credited to your org/team
  • If you beat our best result, we write a post about it
  • We do not gatekeep negative results — if you tried something and it failed, that is as useful as a win

You keep all rights to your work. We ask only for permission to list your result on this page with a link back to you.