Astral
Back to research

Technical note

Gemma 4 E2B as an End-to-End Drone Navigation Controller: A Pilot Trial in the 25-VLM Lineup

TL;DR

  • Adds Gemma 4 E2B to the 25-VLM closed-loop Isaac Sim benchmark.
  • Compares end-to-end goal prediction against modular deployment of the same weights as a semantic target selector.
  • Illustrates the leverage of the separation principle on a single model.
ModelGemma 4 E2B
BenchmarkIsaac Sim closed-loop
Lineup25 VLMs

Abstract

Adds Gemma 4 to the same Isaac Sim closed-loop benchmark and compares end-to-end goal prediction against modular deployment of the same weights as a semantic target selector, illustrating the leverage of the separation principle.

The 25-VLM benchmark in Closing the Metric Gap was run before Gemma 4 was available. When Google released Gemma 4 E2B, we added it to the same Isaac Sim closed-loop benchmark as a self-contained pilot trial. This note documents the results and draws out the architectural lesson.

Two modes on the same weights

We tested Gemma 4 E2B in two configurations:

  • End-to-end (E2E): the model receives an egocentric RGB frame and a natural-language goal, and outputs a navigation command directly. This is the configuration every model in the 25-VLM lineup used.
  • Modular (semantic selector): the model receives the same frame and goal, but outputs only a target object identification — which object in the scene is the goal. A separate metric depth module converts that identification to a 3D coordinate, and a classical planner executes the motion.

Identical weights. Different architectural role. The difference in navigation outcomes is the most direct illustration of the separation principle we have.

Results

In the end-to-end configuration, Gemma 4 E2B follows the pattern of every other model in the lineup — it underperforms the hover baseline. In the modular configuration, the same weights achieve competitive navigation success. The model's semantic understanding is not the problem; the metric grounding task is.

We chose Gemma 4 for this comparison because it is a capable, openly-available model that many practitioners will have evaluated for their own applications. The finding is not specific to Gemma 4 — we expect the same architectural leverage to hold on any model with decent object recognition — but Gemma 4's wide familiarity makes it a useful reference point.

What this note adds to the full benchmark

The 25-VLM paper characterizes the failure mode and proposes the modular fix. This note shows the fix working on a single specific model in direct side-by-side comparison. The E2E vs. modular comparison on the same weights is a more controlled experiment than comparing across models with different architectures and training histories.

If you want the fuller quantitative picture, read Closing the Metric Gap. If you want the engineering record of how the modular architecture was developed over 18 iterations, read Engineering the Separation Principle.