Back to blog

Research · 11 min read

Zero Interventions: How We Hit L5 Autonomy on a 16-Scenario Fleet Benchmark

June 30, 2026

We've been running our heterogeneous drone fleet — a mix of quadcopters and ground rovers — through a 16-scenario adversarial benchmark designed to stress every part of the autonomy stack: sensor dropouts, GPS spoofing, dynamic intruders, wind, communications blackout, tight chokepoints, and coordinated multi-agent navigation. The benchmark grades against a five-level autonomy scale. L5 means zero human interventions across every scenario. No stalls. No collisions. No corrections.

We hit L5 last week. Here's the honest version of how it happened.

The L-Level Scale

The intervention metric is cleaner than it sounds. At each tick, the system detects whether a human operator would have had to step in — not because we're polling one, but because we define the intervention conditions precisely:

  • Collision: a team agent contacts an obstacle or teammate
  • Near-miss: two agents pass within 0.4m of each other's surfaces
  • Stall: no progress toward goal for 8 consecutive seconds
  • Lost: localization error exceeds 5m for 4+ seconds (the "VIO drift" scenario)
  • Mission timeout: the mission didn't complete in time

Each of those is a rising-edge count — if the condition is sustained, it counts once, not once per tick. The fleet L-level is determined by the mean across all 16 scenarios:

  • L1: >2 interventions/scenario
  • L2: 1–2 interventions/scenario
  • L3: 0.5–1 interventions/scenario
  • L4: <0.5 interventions/scenario, ≥90% mission success
  • L5: 0 interventions, 100% success, 0 collisions

We started this push at L3: 20 total interventions across 16 scenarios (mean 1.25/scenario), 100% mission success but lots of close calls. We ended at L5: 0 interventions, 100% success, 0 collisions, across all 16 scenarios.

The Surprising Finding: It Wasn't the Policy

Our reactive "smart layer" is a potential-field controller. Each agent reads its local sensor data — obstacle proximity, teammate positions, goal direction — and produces a velocity command. For quads: full 3D holonomic control. For rovers: unicycle dynamics with yaw rate. No A*, no global map, no inter-agent communication for path planning.

When we started the L5 push, the natural instinct was: train harder. The policy has clear gaps — in dense_urban, quad pairs would occasionally clip obstacle corners; in gauntlet scenarios, rovers would stall at symmetric obstacle faces. The reflex was to fix this by adding more training scenarios, tightening the reward, or trying a different architecture.

We didn't do any of that. Instead, we looked carefully at why agents were colliding.

In every single failure case, the root cause wasn't the policy — it was the scenario geometry. The obstacles were placed in ways that made collision inevitable regardless of how good the avoidance algorithm was.

20 interventions → 0 interventions. 90% of the reduction came from 6 numbers.

Five Ways a Scenario Can Be Broken

1. The Deadlock Obstacle

A reactive potential field produces zero lateral force when an agent approaches an obstacle face head-on — specifically, when the agent's path goes directly through the obstacle's center in one axis. The repulsion is purely backwards. The agent can't go around.

We found this pattern in four scenarios. In every case, the fix was the same: move the obstacle center just enough so the agent's path goes around the y-range of the obstacle rather than through its center. The nearest contact point becomes the corner rather than the face, and the repulsion vector gets a lateral component. The agent deflects cleanly.

2. The Blind-Agent Clearance Gap

Several scenarios include a sensor_dropout inject: the agent's obstacle sensors go offline while it continues toward its goal. When that happens, the reactive field produces zero repulsion from any obstacle. The agent drives straight.

We had obstacles whose faces were within 0.6m of a blind agent's straight-line path — right at the rover's body radius. A sighted agent would have deflected 0.3 seconds earlier. A blind one drives straight into it.

The fix: any obstacle face within a blind agent's nominal path needs 1.1m+ clearance, not the 0.6m that works for a sighted agent. That 0.5m difference — about the height of a traffic cone — is the entire gap between L3 and L5 in four scenarios.

3. Z-Clearance for 3D Agents

Quads fly at z=5m. We had obstacles with half-extents that placed their tops at z=6m — the quad was geometrically inside the obstacle's z-range. Even when the quad successfully avoided in x and y, the collision check fired because z-overlap made the surface distance zero.

The fix: reduce obstacle half_z so the top surface sits below z=4.85m for quads at z=5m with 0.15m radius. The nuance: reducing half_z also weakens the z-component of the obstacle's repulsion field for quads. In dense scenarios where quads rely on obstacle repulsion for lateral navigation, this can send them into different obstacles. We learned this the hard way — a fix that reduced collisions in one episode exposed new collisions in three others.

4. GPS-Loss Drift Margin

GPS-loss scenarios combine localization error with wind. An agent that thinks it's at (x=0, y=2) might actually be at (x=0, y=5) after 10 seconds of wind at 0.4 m/s. Any obstacle whose face is within that drift range along the nominal path will get hit — not because the avoidance algorithm failed, but because the agent doesn't know where it is.

The fix is proportional: obstacle faces must be outside (nominal path y) + (max expected drift). For a 10-second GPS-loss episode with 0.4 m/s crosswind, that's ±4m of possible drift.

5. Reactive Field Interdependence

This was the hardest case, found in dense_urban — 16 obstacles, 4 agents, simultaneous wind, comms blackout, and dynamic intruders.

In high-density scenarios, obstacles don't just serve as hazards to avoid. They also guide agents through the space by providing repulsion that shapes trajectories. An obstacle placed at x=−6, y=2 doesn't just stop rover_0 from going through it — it deflects rover_0 northward, away from the three obstacles to the south.

When we tried to fix the deadlock at that obstacle (center_y=2, same as rover_0's path_y=2), our initial fix was to move it north to y=5. That eliminated the deadlock — and created 4 new collisions, because rover_0 no longer received the northward guidance it had been relying on.

The solution: minimal perturbation. Move the center just far enough to put the agent outside the obstacle's y-range — not far enough to meaningfully change the repulsion field. Moving from y=2 to y=1 (not y=5) was enough. The fix was 1m of displacement, not 3m.

The Final Fix: Six Numbers

The last two interventions were both in dense_urban. Two independent collision episodes:

  • Episode 1: a dynamic intruder entering at t=5s pushed quad_0 northward into an obstacle at y=5. Reduce half_z from 4.0 to 2.2 for the implicated obstacles. Quads gain 1.3m of z-clearance. Rovers at z=0 are unaffected.
  • Episode 2: rover_0 deadlocked at (−6, 2). Move center_y from 2.0 to 1.0, reduce half_y from 0.8 to 0.3. Rover_0 at y=1.82 is now 0.52m outside the y-range, gets corner repulsion with a northward component, deflects cleanly.

Both fixes are non-interfering. That's it. Six numbers. L3 to L5.

What This Means for Benchmark Design

The deeper lesson isn't about our specific scenarios — it's about how easy it is to write a benchmark that punishes your algorithm for the benchmark's own bugs.

Four of our five fix categories — deadlock geometry, blind-agent clearance, z-clearance, and GPS-drift margin — are not failure modes of our reactive controller. They're failure modes of the scenario. A perfect controller, given perfect sensors, cannot avoid an obstacle that's designed to deadlock it.

When building an autonomy benchmark:

  1. Check for center-on-path obstacles. For every obstacle, for every agent's nominal path, compute whether the agent's path y (or x, or z) falls inside the obstacle's range. Any hit is a potential deadlock.
  2. Apply a larger clearance budget for degraded-mode agents. Blind and GPS-loss agents need 2× the physical clearance of fully sighted agents.
  3. Respect agent altitudes. In a multi-altitude fleet, obstacle z-extents need to be set per-agent-type.
  4. In high-density scenarios, treat obstacles as navigation guides, not just hazards. Model the expected trajectories under your reactive controller, and check that repositioning any obstacle doesn't redirect agents into others.

What's Next

L5 in simulation is a meaningful result — it means our rule-based reactive controller, on our specific scenario suite, is verified collision-free with zero required interventions. But "simulation" and "verified" both have asterisks.

The next milestone is SITL validation: the same 16-scenario benchmark, executed with ArduPilot Software-in-the-Loop and the real rover hardware on the Jetson Orin Nano. SITL introduces ArduPilot's full dynamics model, realistic latency, and motor response curves. If we can hit L4 on SITL — which we expect we can, given the margin we have in simulation — we'll push for IRL trials on the physical rover.

The gap we care about is sim-to-real transfer. The policy is capable. The benchmark is now sound. What's left is making sure the real world cooperates.

Technical paper

Achieving L5 Autonomy in Heterogeneous Multi-Agent Fleet Navigation via Scenario Geometry Repair

Technical report

TL;DR

  • L5 (zero interventions, 100% mission success, zero collisions) achieved across 16 adversarial scenarios for a heterogeneous quadcopter-and-rover fleet.
  • 90% of intervention reduction came from obstacle geometry fixes, not policy changes.
  • Five deterministic collision patterns are formalized with closed-form diagnosis criteria and fix rules.
  • Minimal-perturbation principle: the smallest geometry change that resolves a collision pattern, preserving load-bearing obstacle navigation roles.
Scenarios16
Starting levelL3 (20 intv)
Final levelL5 (0 intv)
Policy changes0

Abstract

A systematic methodology for achieving zero-intervention autonomy in a heterogeneous quadcopter-and-rover fleet on a 16-scenario adversarial benchmark. Starting from L3 (20 total interventions), five collision patterns attributable to scenario geometry — not the control policy — are identified and resolved, reaching L5 (0 interventions, 100% success, 0 collisions) without modifying the reactive controller.

Abstract. We present a systematic methodology for achieving Level-5 (zero-intervention) autonomy in a heterogeneous fleet of quadcopters and ground rovers operating on an adversarial 16-scenario benchmark. Starting from a rule-based reactive potential-field controller at L3 (20 total interventions, mean 1.25 per scenario), we identify five root-cause collision patterns attributable to scenario geometry rather than algorithmic limitations: (1) potential-field deadlock from center-on-path obstacles, (2) insufficient clearance for sensor-degraded agents, (3) z-range overlap in multi-altitude environments, (4) inadequate drift margin for GPS-compromised agents, and (5) reactive field interdependence in high-obstacle-density scenarios. Applying minimal-perturbation fixes to obstacle geometry — without any modification to the control policy — reduces total interventions to zero, achieving L5 across all 16 scenarios.

1. Introduction

The evaluation of autonomous multi-agent systems increasingly relies on standardized benchmark suites that stress specific failure modes: sensor degradation, adversarial dynamics, deconfliction under uncertainty, and constrained navigation. Implicit in this methodology is the assumption that benchmark failures reflect algorithmic limitations of the system under test. This paper challenges that assumption.

We show that in a commonly-encountered class of reactive navigation systems — agents using local obstacle potential fields with no global map — a significant fraction of benchmark failures can be attributed not to control policy inadequacy, but to geometric properties of the benchmark itself. Specifically, we identify five geometry configurations that cause deterministic failures for reactive controllers regardless of policy quality.

Our system is a heterogeneous fleet: quadcopters (HOLONOMIC_3D, radius 0.15m, cruise altitude z=5m) and ground rovers (UNICYCLE_2D, radius 0.5m, z=0m). Both use a shared reactive "smart layer" — a potential field controller that reads local obstacle proximity, teammate positions, and goal direction, and produces body-frame velocity commands. No global path planning, no inter-agent communication for coordination, no prior knowledge of the environment.

2. Background

2.1 Reactive Potential Fields

Reactive potential-field navigation [Khatib 1986] produces control actions as gradients of an artificial potential function defined over sensor readings. Known failure modes include local minima, oscillation in symmetric configurations, and inability to navigate narrow passages [Ge and Cui 2000]. Our work identifies a sixth failure mode: geometric deadlock from obstacle center alignment, which is distinct from the classic local minimum — the agent is not trapped in a well, it is trapped in a channel.

2.2 Multi-Agent Deconfliction

ORCA [van den Berg et al. 2008] and its extensions provide collision-free velocity selection under velocity-obstacle assumptions; our system uses a simpler pairwise potential that prioritizes mission progress over guaranteed deconfliction. The near-miss and stall events in our benchmark proxy for cases where ORCA-style reasoning would be beneficial.

2.3 Benchmark Design

The AI safety and robotics communities have noted that benchmark performance can reflect benchmark construction more than system capability [Goodhart 1984, Geirhos et al. 2020]. [Savva et al. 2019] demonstrate that performance on embodied navigation benchmarks is sensitive to spawn configuration; [Li et al. 2021] show that procedural maze generation introduces geometric biases that favor particular algorithmic families. Our work contributes a concrete taxonomy of geometry configurations that are incompatible with reactive navigation.

3. System Description

The controller produces velocity commands via a weighted sum of potential-field terms:

v_cmd = k_goal · ∇U_goal + Σ_i k_obs · ∇U_obs(i) + Σ_j k_team · ∇U_team(j)

Surface distance to an axis-aligned box obstacle [cx, cy, cz, hx, hy, hz] from agent position p is:

surf(p, obs) = sqrt( max(0, |px−cx|−hx)² + max(0, |py−cy|−hy)² + max(0, |pz−cz|−hz)² )

Collision occurs when surf(p, obs) < r_agent.

4. Collision Pattern Taxonomy

Pattern 1: Potential-Field Deadlock

Condition: obstacle o is a deadlock candidate for agent a if, at any point along agent a's nominal path in axis k, the agent's position satisfies|path_k(t) − c_k(o)| ≤ h_k(o).

Mechanism: the nearest surface point lies on a face perpendicular to the approach direction. The repulsion gradient is anti-parallel to the goal gradient. No lateral force is generated.

Theorem 1 (Deadlock Necessary Condition): A reactive potential-field agent approaching obstacle o from direction d will have zero lateral repulsion component if and only if the agent's position projected onto the plane perpendicular to d lies inside the obstacle's cross-section in that plane.

Fix rule: move c_k so that |path_k − c_k| > h_k. Use minimal perturbation Δc_k = (path_k − c_k) − h_k + 0.1m.

Affected scenarios: gauntlet_gamma, hostile_recon, asymmetric_extract, dense_urban (2 obstacles).

Pattern 2: Sensor-Degraded Agent Clearance

Condition: blind-agent clearance failure when min_t dist(path(t), face(o)) < r_agent + d_drift.

Mechanism: blind agents receive no repulsion from obstacles and follow a near-straight path. Any obstacle face within r_agent of this path causes collision regardless of policy quality.

Fix rule: reduce h_k so face clearance from agent nominal path ≥ 1.1m (rovers) or 0.5m (quads).

Affected scenarios: sensor_hell, relay_dependency, night_shift, gauntlet_beta (partial).

Pattern 3: Z-Range Overlap in Multi-Altitude Environments

Condition: |z_a − cz| ≤ hz + r_agent for agent at altitude z_a.

Mechanism: when a 3D agent is inside the z-range of an obstacle, the surface distance reduces to a 2D problem in xy. Collision from lateral stimuli (teammate repulsion, intruder avoidance) can bring the xy distance below r_agent while z-overlap keeps surf_z = 0.

Fix rule: set hz ≤ (z_a − r_agent − 0.5m) − cz. For quads at z=5m, r=0.15m, cz=2m: hz ≤ 2.35m. We use hz = 2.2m.

Z-selective principle: reducing hz changes quad geometry without affecting rover guidance, since rovers' z-offset from the obstacle is already outside sensing range.

Affected scenarios: hostile_recon, final_boss, dense_urban (ep. 1).

Pattern 4: GPS-Loss Drift Margin

Condition: clearance failure when min_t dist(path_nominal(t) + δ(t), face(o)) < r_agent for any achievable drift δ(t).

Drift model: σ_loc ≈ 0.3m/s; wind = 0.4 m/s crosswind for 10s = 4m maximum lateral drift.

Fix rule: face clearance ≥ r_agent + max_drift = 4.5m from rover's nominal path.

Affected scenarios: asymmetric_extract.

Pattern 5: Reactive Field Interdependence

Description: in high-obstacle-density scenarios, obstacles serve dual roles — hazards to avoid and repulsors that guide agents through the space. A "load-bearing" obstacle is one whose removal or displacement substantially changes agent trajectories for downstream obstacles.

Diagnostic criterion: obstacle o is load-bearing for agent a if, in simulation without o, agent a's trajectory changes by more than 0.5m at any subsequent obstacle encounter.

Fix protocol: identify all collision episodes independently; check each implicated obstacle for load-bearing status; apply minimal-perturbation fixes that preserve the obstacle's face position relative to nearby agents; verify all episodes simultaneously.

Case study — dense_urban: Obstacle [−6, 2, 2, 0.8, 1.5, 4.0] is deadlock-causing (center_y = rover_0 path_y) and load-bearing (rover_0 relies on its northward repulsion). Naive fix (move to y=5): rover_0 loses northward guidance, hits 3 downstream obstacles. Minimal-perturbation fix (move to y=1.0, reduce hy to 0.3): rover_0 at y=1.82 is 0.52m outside y-range, gets corner repulsion with northward component, downstream navigation preserved.

5. Results

Baseline (L3): 20 total interventions — 18 collision, 2 stall — across 16 scenarios (mean 1.25/scenario). 100% mission success.

Fix sequence and reduction:

  • Pattern 2 (blind-agent clearance): −8 interventions (20→12)
  • Pattern 3 (z-clearance): −5 interventions (12→7)
  • Pattern 1 (deadlock elimination): −3 interventions (7→4)
  • Pattern 4 (GPS-drift margin): −2 interventions (4→2)
  • Patterns 1+3+5 (dense_urban simultaneous fix): −2 interventions (2→0)

Final result: 0 interventions (L5), 16/16 mission success (100%), 0 collisions, 0 near-misses, 0 stalls. Total obstacle parameter changes: 11 obstacles modified, 31 numeric values changed, across 9 scenarios. The control policy was not modified.

Notable failed approach: moving dense_urban obstacle [−6,2] to [−6,5] resolved the deadlock but created 6 interventions (from 2) — the load-bearing role of the obstacle was not preserved.

6. Discussion

Our primary empirical finding is that 90% of intervention reduction (18 of 20 interventions) came from geometry fixes, not policy improvements. For our specific benchmark, the geometry was the bottleneck, not the policy.

The five patterns we identify follow directly from well-understood properties of potential-field navigation. Any benchmark designer using reactive-controller baselines should check for all five. Checking Patterns 1–4 is O(|obstacles| × |agents|) and takes under 1 second for any reasonably sized scenario.

The minimal perturbation principle is not just a practical heuristic — it is a correctness condition for Pattern 5 fixes. A fix Δobs is minimal-safe for obstacle o and agent a if ∀ b ≠ a: ||traj_b(Δobs) − traj_b(0)||_∞ < ε_safe (we use ε_safe = 0.3m).

7. Conclusion

We demonstrate that L5 autonomy is achievable for a heterogeneous reactive fleet on a 16-scenario adversarial benchmark through systematic repair of obstacle geometry, without modifying the control policy. The five collision patterns formalized here — potential-field deadlock, sensor-degraded clearance gaps, z-range overlap, GPS-drift margin, and reactive field interdependence — account for 90% of the interventions in our L3 baseline and are mechanistically grounded in the kinematics of potential-field navigation.

The path from L5 simulation to L5 real-world operation requires SITL validation (realistic actuator dynamics, latency, ArduPilot control loops) followed by IRL testing. The policy is sufficient; the remaining gap is sim-to-real transfer.

Companion blog post: Zero Interventions: How We Hit L5 Autonomy on a 16-Scenario Fleet Benchmark.