Research · 15 min read

We Said We Hit L5. Then We Tested With a Real Sensor Model.

July 2, 2026

A few weeks ago we wrote that we'd hit L5 autonomy — zero human interventions across a 16-scenario adversarial benchmark, a heterogeneous fleet of quadcopters and ground rovers. That result was real, and it's still in the repo. But it had an asterisk we didn't fully appreciate at the time, and chasing that asterisk turned into the most honest piece of engineering we've done this year. This is that story — including the parts that didn't work.

The asterisk

Our L5 controller is a reactive potential field: each agent reads its local surroundings — obstacle proximity, teammate positions, goal direction — and produces a velocity. To decide how hard to avoid an obstacle, it uses a number called min clearance: the distance to the nearest obstacle surface.

In the simulator, that number was the true perpendicular distance to the nearest obstacle surface — computed from ground-truth geometry. That is a quantity no real sensor produces. A lidar gives you a ring of range readings. A depth camera gives you a fan of them. Neither hands you “the exact distance to the nearest surface.” We were grading our autonomy on information it would never have on real hardware.

So we built a second sensing mode into the benchmark — --sensing realistic — that gives the controller only what a real sensor gives: the nearest return of the scan, no oracle. Then we re-ran the suite.

L5 became L3

Under realistic sensing, the fleet dropped from L5 to L3: two collisions, 81% mission success. The “L5” had been partly an artifact of idealized perception. That was a bad afternoon — but it's exactly the kind of thing you want to find in sim, on your own terms, rather than in a field test.

The first thing we did was make the benchmark honest permanently: realistic sensing plus per-agent, per-cause failure attribution, so every collision and every timeout names the agent and the reason. No more grading against an oracle.

The second thing we did was ship the exact validated controller onto the device code path — the same reactive controller and rule-based fleet advisor, running on the Jetson's flight software, proven byte-for-byte identical to the sim by a parity test. Whatever we validated is literally what flies. No “the real code is different” gap.

Getting back to L5 (deterministic)

With attribution, the failures were specific. Two mechanisms:

Thin geometry margins. Some obstacles were placed with just enough clearance to work under perfect sensing. With realistic sensing the agent's trajectory drifts ~0.2m, and that ate the margin. The fix was the same minimal-perturbation geometry repair we'd used to reach L5 originally — now with a sensing-drift budget. Move two obstacle sets a little; collisions go to zero.
Timidity near the goal. A few agents stalled ~2.5m short of goal, pinned behind patrolling intruders, just outside the range where the advisor's “near-goal lock” punches through. Raising that engagement range from 2.0m to 3.0m cleared all of them — without reintroducing a single collision.

Result: L5 under realistic deterministic sensing, both the oracle and the realistic modes, zero collisions, 100% success. Regression-locked with a test.

The nine things that didn't work

One deterministic run isn't “probably.” So we added a sensor-noise model — ±5cm Gaussian range noise plus 3% beam dropout — and ran a Monte Carlo: dozens of random seeds across all 16 scenarios. Deterministic L5 dropped to ~92% collision-free per run. The remaining collisions concentrated in two scenarios.

We assumed the culprit was fast dynamic intruders, and we spent real effort on it. In order, here is what we tried and what happened:

Surface reconstruction (fit the obstacle edge from adjacent beams) — no change.
Wider avoidance radius — no change.
Hold-last temporal filter to bridge dropouts — worse (ghosting).
Temporal median filter — much worse (lag: it trails the shrinking clearance during approach).
A scalar “something's close” gate on the near-goal lock — traded collisions for timeouts.
A measurement-uncertainty clearance margin — within noise, no real help.
A lag-compensated clearance estimator — looked like a breakthrough until a fair test showed the prototype had a state-leak bug flattering the numbers; done correctly, no help.
A retrained learned policy — a from-scratch RL run never converged; the existing noise-trained policies were dramatically worse than the rule-based controller.
A velocity-obstacle / ORCA solver — the correct tool for reciprocal agents, but it made things worse: the intruders are scripted and non-cooperative, so evading them in a tight corridor is worse than committing.

Nine approaches. None closed the gap. That's not a fun paragraph to write, but it's the true one — and it's what pointed at the real problem.

The diagnosis we'd gotten wrong

Every one of those approaches assumed a dynamic-obstacle problem. The attribution said otherwise: the collisions were labeled rover vs obstacle, not vs intruder. In the corridor scenario, the intruders fly at a different altitude than the ground rover — they can't even hit it. The rover was clipping the static wall under range noise. It was a geometry-margin problem the whole time, wearing a dynamic-obstacle costume.

The fix was the method that had worked twice already: widen the tight rover passages with a noise-drift budget (the quads fly over the tops, so it's rover-only and safe). Two scenario edits.

Result: 99.5% collision-free across 800 randomized noisy runs, up from 92.3% — deterministic L5 preserved, no controller changes. The remaining ~0.5% is one scenario (a rover that goes blind under GPS loss and wind simultaneously) where the obstacles are load-bearing for navigation and widening them backfires. We left it honest rather than force it.

Through the real autopilot

The last thing between an algorithm and an aircraft is the autopilot: actuator lag, control-loop latency, imperfect velocity tracking. So we ran the L5 controller through ArduPilot SITL — the real flight-control software — feeding it the realistic sensor model and letting ArduCopter and ArduRover fly the setpoints.

16 out of 16 collision-free, both vehicle classes, eight scenarios each. The quad passed immediately. The rover clipped obstacles until we found the cause — we were launching it as a car (Ackermann steering, fixed turn radius) when the controller assumes a skid-steer that turns in place. A vehicle-model mismatch, not a controller fault. With the right model, 8/8.

So do we have IRL L5?

No — and we're not going to say we do. Here is the exact ladder:

✅ L5 in idealized sim
✅ L5 under realistic deterministic sensing, on the device code path
✅ 99.5% collision-free under realistic noisy sensing (800 randomized runs)
✅ 16/16 collision-free through the real autopilot in SITL, quad and rover
⬜ Actual hardware flight — not started

Everything above the last line is sim and software-in-the-loop. “IRL L5” means the last line: real aircraft, real sensors, real flights, and we have zero of that data. What we can honestly say is: L5 in sim, robust to sensor noise, and validated on the real autopilot — hardware pending. The next milestone is a physical rover on a measured course.

The reason we're writing this the way we are — negative results and all — is that the version where we quietly patched the benchmark and kept the L5 banner would have been the easy one, and the wrong one. The gap between “works in the demo” and “works on the vehicle” is where autonomy programs actually live or die, and the only way through it is to keep finding your own asterisks before the field does.

The full technical write-up — the realistic-sensing benchmark, the failure taxonomy, the SITL results — is in the companion paper: From L5-in-Sim to the Real Autopilot: A Sim-to-Real Case Study. The original L5 result is here.

Technical paper

From L5-in-Sim to the Real Autopilot: A Sim-to-Real Case Study for a Reactive Fleet

Technical report

TL;DR

The published L5 depended on true-surface clearance (an oracle); under realistic sensing it was L3.
Geometry repair with a noise-drift budget reaches 99.5% collision-free over 800 randomized noisy runs, no controller changes.
Nine controller-side avoidance approaches (incl. ORCA and a retrained policy) failed; the residual failures were static wall-clips misdiagnosed as dynamic intruders.
16/16 collision-free through ArduPilot SITL (ArduCopter + ArduRover); hardware flight is the remaining gate.

Idealized sensing	L5
Realistic noisy sensing	99.5% clean / 800 runs
SITL (quad + rover)	16/16
IRL	pending

Abstract

A follow-up to our L5 result. The L5 controller's clearance term used a true-surface distance no real sensor produces; under a realistic sensor model the suite was L3. We report the honest path back: an on-device controller proven byte-identical to sim, geometry repair with a sensing-noise budget (99.5% collision-free over 800 randomized noisy runs), a taxonomy of nine avoidance approaches that failed and why, and 16/16 collision-free validation of both vehicle classes through ArduPilot SITL. We do not claim IRL L5 — hardware is the remaining gate.

Abstract. We report a sim-to-real case study following our earlier result of Level-5 (zero-intervention) autonomy for a heterogeneous quadcopter-and-rover fleet on a 16-scenario adversarial benchmark. That result was obtained under idealized sensing: the reactive controller's clearance term used the true perpendicular distance to the nearest obstacle surface — a quantity no physical range sensor produces. Under a realistic sensor model (nearest scan return only), the suite regressed to L3. We present the path back to L5 and beyond: (1) an on-device controller proven byte-identical to the simulator by a parity test; (2) a realistic-sensing benchmark with per-agent, per-cause failure attribution; (3) recovery of deterministic L5 via minimal-perturbation geometry repair with a sensing-drift budget and a throughput fix; (4) a distributional evaluation reaching 99.5% collision-free across 800 randomized noisy runs; (5) a taxonomy of nine controller-side avoidance approaches that failed, and the diagnosis error they shared; and (6) 16/16 collision-free validation of both vehicle classes through ArduPilot software-in-the-loop. We explicitly do not claim in-real-life L5; hardware flight is the remaining gate.

1. The idealized-sensing artifact

The controller is an analytic potential field: goal attraction plus inverse-distance repulsion from sensed neighbors and the nearest obstacle return, emitting a per-class action (holonomic for quads, unicycle for rovers). Avoidance magnitude and speed gating depend on a scalar min clearance. In the kinematic simulator this was computed as the true perpendicular distance to the nearest axis-aligned obstacle box. No lidar or depth camera yields this: sensors return a discrete set of ranges along fixed bearings. Grading autonomy on true-surface distance therefore over-credits the controller relative to any realizable deployment.

We added a realizable sensing mode: min_clearance = nearest scan return only. Under it, with the same controller and fleet advisor, the suite scored L3 (two collisions, 81% mission success) versus L5 (0, 100%) under the oracle.

2. Method: honest benchmark + on-device parity

Two infrastructure changes preceded any fix. First, the scorecard was made to evaluate on the realistic sensor model by default, and to attribute every intervention to a specific agent and cause (collision / stall / timeout, with the implicated obstacle or teammate). Second, the exact validated controller and rule-based fleet advisor were extracted to a device-installable module and pinned byte-for-byte to the simulator implementation by a parity test over hundreds of randomized observations per vehicle class — so the artifact under test is literally the flight code, not a re-implementation.

3. Recovering deterministic L5 under realistic sensing

Attribution localized the L3 failures to two mechanisms. (a) Thin geometry margins: obstacles placed with clearance sufficient only under perfect sensing; realistic sensing induces ~0.2m trajectory drift that consumes the margin. The fix is the same minimal-perturbation geometry repair used to reach the original L5, now sized to a sensing-drift budget. (b) Near-goal timidity: agents stalling ~2.5m short of goal behind patrolling intruders, just outside the advisor's near-goal override range; raising that range from 2.0m to 3.0m cleared the timeouts without introducing collisions. Both sensing modes then scored L5 (0 interventions, 100%, collision-free), regression-locked.

4. Distributional evaluation and a taxonomy of failed fixes

A single deterministic seed is insufficient. We added a seeded sensor-noise model (±5cm Gaussian range noise, 3% per-beam dropout) and ran a Monte-Carlo over dozens of seeds × 16 scenarios. Deterministic L5 fell to ~92% collision-free per run, with failures concentrated in two scenarios. Under the hypothesis that fast dynamic intruders were the cause, we evaluated nine controller-side approaches:

Local surface reconstruction from adjacent beams — no effect.
Increased avoidance radius — no effect.
Hold-last temporal dropout fill — regressed (ghosting).
Temporal median filter — regressed (approach-lag bias).
Scalar scan-proximity gate on the near-goal override — traded collisions for timeouts.
Measurement-uncertainty clearance margin — within noise.
Lag-compensated clearance estimator — a prototype state-leak flattered results; corrected, no effect.
Retrained learned policy — from-scratch RL did not converge; existing noise-domain-randomized policies were substantially worse than the rule-based controller.
Velocity-obstacle (ORCA-style) selection — regressed; the intruders are scripted / non-reciprocal, for which evasion in tight corridors underperforms committing.

All nine assumed a dynamic-obstacle problem. Attribution contradicted this: the residual collisions were labeled rover vs obstacle. In the corridor scenario the intruders occupy different altitudes than the ground rover and cannot contact it; the rover was clipping the static corridor wall under range noise. The failure was a geometry-margin problem throughout. Applying the noise-budget geometry repair to the two implicated passages yielded 99.5% collision-free across 800 randomized noisy runs (from 92.3%), deterministic L5 preserved, with no controller changes. The residual ~0.5% is a single scenario in which a rover is simultaneously blind and GPS-denied and whose obstacles are load-bearing for navigation (widening them regressed other agents); we report it rather than force it.

5. Software-in-the-loop validation

To exercise real actuator dynamics, latency, and control loops, we drove the L5 controller through ArduPilot SITL, reusing the simulator's sensing so that only the dynamics differ, with the controller's velocity commands tracked by ArduCopter / ArduRover in GUIDED mode. Result: 16/16 collision-free across both vehicle classes and eight scenarios each (quad min clearance 0.98–2.2m; rover 0.64–2.3m). The rover initially failed due to a vehicle-model mismatch — the default Ackermann model has a fixed turn radius, whereas the controller assumes a skid-steer unicycle that turns in place; using the skid model with the unicycle-native command resolved it. This was a model mismatch, not a controller fault.

6. Claim scope and remaining work

The defensible claim is: L5 in simulation, robust to sensor noise (99.5% over 800 randomized runs), and validated on the real autopilot in SITL for both vehicle classes. We do not claim in-real-life L5. Every result here is simulation or software-in-the-loop; no physical-flight data exists yet. The next milestone is a physical rover on a measured/motion-capture obstacle course, scoring interventions against ground truth, followed by aerial and mixed-team trials.

7. Lessons

Benchmarks silently encode oracle information. A single scalar (true-surface clearance) was the difference between L5 and L3; audit every quantity your controller consumes for realizability.
Attribution beats intuition. Nine fixes targeted a dynamic-obstacle problem that did not exist; the per-cause labels pointed at static geometry immediately once trusted.
For this system, robustness lived in geometry margins versus sensing noise, not in the controller's dynamic-obstacle handling. Sophisticated avoidance (ORCA, learned policies) did not help and sometimes hurt.
SITL earns its keep: it caught a vehicle-model assumption that sim never would.

Companion blog post: We Said We Hit L5. Then We Tested With a Real Sensor Model. The original result: Achieving L5 Autonomy via Scenario Geometry Repair.