From L5-in-Sim to the Real Autopilot: A Sim-to-Real Case Study for a Reactive Fleet

A follow-up to our L5 result. The L5 controller's clearance term used a true-surface distance no real sensor produces; under a realistic sensor model the suite was L3. We report the honest path back: an on-device controller proven byte-identical to sim, geometry repair with a sensing-noise budget (99.5% collision-free over 800 randomized noisy runs), a taxonomy of nine avoidance approaches that failed and why, and 16/16 collision-free validation of both vehicle classes through ArduPilot SITL. We do not claim IRL L5 — hardware is the remaining gate.

Idealized sensing	L5
Realistic noisy sensing	99.5% clean / 800 runs
SITL (quad + rover)	16/16
IRL	pending

Abstract. We report a sim-to-real case study following our earlier result of Level-5 (zero-intervention) autonomy for a heterogeneous quadcopter-and-rover fleet on a 16-scenario adversarial benchmark. That result was obtained under idealized sensing: the reactive controller's clearance term used the true perpendicular distance to the nearest obstacle surface — a quantity no physical range sensor produces. Under a realistic sensor model (nearest scan return only), the suite regressed to L3. We present the path back to L5 and beyond: (1) an on-device controller proven byte-identical to the simulator by a parity test; (2) a realistic-sensing benchmark with per-agent, per-cause failure attribution; (3) recovery of deterministic L5 via minimal-perturbation geometry repair with a sensing-drift budget and a throughput fix; (4) a distributional evaluation reaching 99.5% collision-free across 800 randomized noisy runs; (5) a taxonomy of nine controller-side avoidance approaches that failed, and the diagnosis error they shared; and (6) 16/16 collision-free validation of both vehicle classes through ArduPilot software-in-the-loop. We explicitly do not claim in-real-life L5; hardware flight is the remaining gate.

1. The idealized-sensing artifact

The controller is an analytic potential field: goal attraction plus inverse-distance repulsion from sensed neighbors and the nearest obstacle return, emitting a per-class action (holonomic for quads, unicycle for rovers). Avoidance magnitude and speed gating depend on a scalar min clearance. In the kinematic simulator this was computed as the true perpendicular distance to the nearest axis-aligned obstacle box. No lidar or depth camera yields this: sensors return a discrete set of ranges along fixed bearings. Grading autonomy on true-surface distance therefore over-credits the controller relative to any realizable deployment.

We added a realizable sensing mode: min_clearance = nearest scan return only. Under it, with the same controller and fleet advisor, the suite scored L3 (two collisions, 81% mission success) versus L5 (0, 100%) under the oracle.

2. Method: honest benchmark + on-device parity

Two infrastructure changes preceded any fix. First, the scorecard was made to evaluate on the realistic sensor model by default, and to attribute every intervention to a specific agent and cause (collision / stall / timeout, with the implicated obstacle or teammate). Second, the exact validated controller and rule-based fleet advisor were extracted to a device-installable module and pinned byte-for-byte to the simulator implementation by a parity test over hundreds of randomized observations per vehicle class — so the artifact under test is literally the flight code, not a re-implementation.

3. Recovering deterministic L5 under realistic sensing

Attribution localized the L3 failures to two mechanisms. (a) Thin geometry margins: obstacles placed with clearance sufficient only under perfect sensing; realistic sensing induces ~0.2m trajectory drift that consumes the margin. The fix is the same minimal-perturbation geometry repair used to reach the original L5, now sized to a sensing-drift budget. (b) Near-goal timidity: agents stalling ~2.5m short of goal behind patrolling intruders, just outside the advisor's near-goal override range; raising that range from 2.0m to 3.0m cleared the timeouts without introducing collisions. Both sensing modes then scored L5 (0 interventions, 100%, collision-free), regression-locked.

4. Distributional evaluation and a taxonomy of failed fixes

A single deterministic seed is insufficient. We added a seeded sensor-noise model (±5cm Gaussian range noise, 3% per-beam dropout) and ran a Monte-Carlo over dozens of seeds × 16 scenarios. Deterministic L5 fell to ~92% collision-free per run, with failures concentrated in two scenarios. Under the hypothesis that fast dynamic intruders were the cause, we evaluated nine controller-side approaches:

Local surface reconstruction from adjacent beams — no effect.
Increased avoidance radius — no effect.
Hold-last temporal dropout fill — regressed (ghosting).
Temporal median filter — regressed (approach-lag bias).
Scalar scan-proximity gate on the near-goal override — traded collisions for timeouts.
Measurement-uncertainty clearance margin — within noise.
Lag-compensated clearance estimator — a prototype state-leak flattered results; corrected, no effect.
Retrained learned policy — from-scratch RL did not converge; existing noise-domain-randomized policies were substantially worse than the rule-based controller.
Velocity-obstacle (ORCA-style) selection — regressed; the intruders are scripted / non-reciprocal, for which evasion in tight corridors underperforms committing.

All nine assumed a dynamic-obstacle problem. Attribution contradicted this: the residual collisions were labeled rover vs obstacle. In the corridor scenario the intruders occupy different altitudes than the ground rover and cannot contact it; the rover was clipping the static corridor wall under range noise. The failure was a geometry-margin problem throughout. Applying the noise-budget geometry repair to the two implicated passages yielded 99.5% collision-free across 800 randomized noisy runs (from 92.3%), deterministic L5 preserved, with no controller changes. The residual ~0.5% is a single scenario in which a rover is simultaneously blind and GPS-denied and whose obstacles are load-bearing for navigation (widening them regressed other agents); we report it rather than force it.

5. Software-in-the-loop validation

To exercise real actuator dynamics, latency, and control loops, we drove the L5 controller through ArduPilot SITL, reusing the simulator's sensing so that only the dynamics differ, with the controller's velocity commands tracked by ArduCopter / ArduRover in GUIDED mode. Result: 16/16 collision-free across both vehicle classes and eight scenarios each (quad min clearance 0.98–2.2m; rover 0.64–2.3m). The rover initially failed due to a vehicle-model mismatch — the default Ackermann model has a fixed turn radius, whereas the controller assumes a skid-steer unicycle that turns in place; using the skid model with the unicycle-native command resolved it. This was a model mismatch, not a controller fault.

6. Claim scope and remaining work

The defensible claim is: L5 in simulation, robust to sensor noise (99.5% over 800 randomized runs), and validated on the real autopilot in SITL for both vehicle classes. We do not claim in-real-life L5. Every result here is simulation or software-in-the-loop; no physical-flight data exists yet. The next milestone is a physical rover on a measured/motion-capture obstacle course, scoring interventions against ground truth, followed by aerial and mixed-team trials.

7. Lessons

Benchmarks silently encode oracle information. A single scalar (true-surface clearance) was the difference between L5 and L3; audit every quantity your controller consumes for realizability.
Attribution beats intuition. Nine fixes targeted a dynamic-obstacle problem that did not exist; the per-cause labels pointed at static geometry immediately once trusted.
For this system, robustness lived in geometry margins versus sensing noise, not in the controller's dynamic-obstacle handling. Sophisticated avoidance (ORCA, learned policies) did not help and sometimes hurt.
SITL earns its keep: it caught a vehicle-model assumption that sim never would.

Companion blog post: We Said We Hit L5. Then We Tested With a Real Sensor Model. The original result: Achieving L5 Autonomy via Scenario Geometry Repair.