Domain-Specific Object Detection for Aerial Autonomy: Sim Data, VisDrone, and the Class Imbalance Problem

Abstract. We present a domain-specific object detector for aerial autonomy applications, trained on a combined corpus of simulation-generated frames and the VisDrone2019-DET public dataset. Starting from a YOLOv8n base, we define a 9-class schema suited to autonomous drone operations (drone, person_aerial, vehicle, bicycle_motorcycle, landing_pad, powerline_pole, person, animal, boat) and evaluate three successive training configurations. The sim-only baseline (v1, 18K frames) reaches mAP50 = 0.471 on a held-out simulation val set. Merging VisDrone real-world footage (v2, 21K images, 364K boxes) improves real-domain person_aerial from 0.047 to 0.377 but causes near-complete drone class collapse (AP50 0.047 → 0.010) due to a 3.5% box-fraction imbalance. A 4× oversampling of drone-containing images (v3, 49K images) recovers drone AP50 to 0.087 while maintaining real-world gains: mAP50 = 0.384, person_aerial = 0.360, vehicle = 0.752, bicycle_motorcycle = 0.339. The primary finding is that class imbalance — not architecture or augmentation — is the dominant failure mode for rare aerial classes in mixed-corpus training.

1. Introduction

Standard object detection models trained on COCO-80 perform well on common ground-level objects but fail systematically in the aerial autonomy domain for three reasons: (1) the label set has no drone, person_aerial, landing_pad, or powerline class; (2) COCO training images are primarily ground-level with standard aspect ratios, while drone cameras produce oblique and overhead views at varying altitudes; and (3) the objects of primary interest in aerial operations — distant targets, thin structures, other aircraft — are consistently underrepresented in internet-scale training corpora.

We address this by fine-tuning YOLOv8n on a domain-specific corpus assembled from two sources: (i) Isaac Sim procedural simulation with camera-model ground truth projection, providing perfect labels at low cost, and (ii) VisDrone2019-DET, a large-scale public real-world aerial dataset providing aerial-perspective vehicle and pedestrian boxes. The goal is a detector that runs on-device at 55+ FPS on an RTX-class GPU and, after TensorRT int8 export, at viable frame rates on the Jetson Orin Nano edge platform.

2. Related work

Aerial object detection. VisDrone (Zhu et al., 2018) established a large-scale benchmark for drone-captured footage and demonstrated that ImageNet-pretrained detectors transfer poorly to the aerial regime. DOTA and xView extend coverage to satellite and high-altitude platforms. Anti-UAV and DroneVehicle datasets target the drone-as-target detection problem specifically.

Sim-to-real for perception. NVIDIA Isaac Sim and similar photorealistic simulators have been used for generating labeled training data when real-world annotation is expensive. Domain randomization (DR) over lighting, texture, and object placement is standard practice. Our setup uses a fixed scene set without DR in v1 and v2; domain gap from sim to VisDrone is the expected cost.

Class imbalance in detection. Class imbalance is a known problem in detection training. Focal loss, class-balanced sampling, and copy-paste augmentation are common mitigations. We apply the simplest effective remedy — image-level oversampling — and find it sufficient to recover a collapsed class.

3. Class schema

The 9-class schema and VisDrone mapping:

0 — drone: unmanned aircraft in frame (sim only; absent from VisDrone)
1 — person_aerial: person seen from above (VisDrone: pedestrian, people)
2 — vehicle: car, van, truck, bus (VisDrone: car, van, truck, bus, tricycle)
3 — bicycle_motorcycle: two-wheelers (VisDrone: bicycle, motor)
4 — landing_pad: ground marker (sim only; no public label source)
5 — powerline_pole: thin vertical / linear structure (sim only)
6 — person: person at ground level / normal angle (sim only)
7 — animal: generic animal (no training data in v1–v3)
8 — boat: watercraft (no training data in v1–v3)

Classes 4–8 have no VisDrone source and limited or zero sim coverage. They are present in the schema for forward compatibility but produce no meaningful AP in current models.

4. Datasets

4.1 Simulation data (v1 corpus)

We drove NVIDIA Isaac Sim headlessly over three scene types (office, warehouse, hospital) with scripted randomized drone trajectories. Per-frame ground truth was produced by projecting known world-space object poses through the camera intrinsic model to COCO-format bounding boxes, giving perfect labels at zero human annotation cost.

Train: 16,200 frames, 10,800 boxes
Val: 1,800 frames, 1,200 boxes
Box distribution: vehicle (class 2) ~60%, drone (class 0) ~40%
Person / person_aerial: near-zero (scenes lacked overhead person views)

4.2 VisDrone2019-DET

VisDrone is a large-scale benchmark of drone-captured real-world footage over urban areas. We use the DET (detection) split:

Train: 6,471 images, 343,204 boxes
Val: 548 images, 38,759 boxes
Classes mapped to schema: pedestrian → person_aerial (1), people → person_aerial (1), car/van/truck/bus/tricycle → vehicle (2), bicycle → bicycle_motorcycle (3), motor → bicycle_motorcycle (3)
Ignored: awning-tricycle, others (no schema mapping)

VisDrone contains no drone-as-target boxes. All drone class data in the merged corpus comes from simulation.

5. Training configuration

All runs: YOLOv8n base, Ultralytics 8.4.66, batch=64, 50 epochs, imgsz=640, AdamW optimizer, 2× RTX 5090 (hoopoe GPU server). No domain randomization, no mosaic augmentation for rare classes. ONNX export: 12 MB, FP32, opset 11.

V1: sim-only corpus (18K images). Val: sim val set.
V2: sim + VisDrone (21,471 images, 364K boxes). Val: VisDrone val set.
V3: sim + VisDrone + 4× oversampled drone images (48,717 images). Val: VisDrone val set.

6. Results

6.1 V1 — simulation baseline

mAP50 = 0.471, mAP50-95 = 0.341
Precision = 0.79, Recall = 0.41
Vehicle AP50 = 0.895
Drone AP50 = 0.047
Person AP50 = 0.894
Person_aerial AP50 = 0.000 (no aerial person views in sim)
Inference: 18 ms / 55 FPS on RTX 5090

The high per-class AP on vehicle and person reflects easy sim geometry. The 0.471 headline is an artifact of evaluating on the same simulator distribution used for training.

6.2 V2 — VisDrone merge

mAP50 = 0.376 (VisDrone val), mAP50-95 = 0.215
Precision = 0.54, Recall = 0.36
Person_aerial AP50 = 0.377 (+0.377 vs v1)
Vehicle AP50 = 0.755 (−0.140 vs v1)
Bicycle_motorcycle AP50 = 0.365 (first non-zero result)
Drone AP50 = 0.010 (−0.037 vs v1; near-collapse)

The mAP50 drop from 0.471 to 0.376 is not a regression — it reflects a harder, real-world val set. The drone collapse is a genuine failure: 343K VisDrone boxes reduced drone boxes to 3.5% of total training signal.

6.3 V3 — 4× drone oversampling

mAP50 = 0.384 (VisDrone val), mAP50-95 = 0.219
Precision = 0.55, Recall = 0.37
Person_aerial AP50 = 0.360 (−0.017 vs v2; stable)
Vehicle AP50 = 0.752 (−0.003 vs v2; stable)
Bicycle_motorcycle AP50 = 0.339 (−0.026 vs v2; small regression)
Drone AP50 = 0.087 (+0.077 vs v2; recovered)

Drone fraction in training images rose from ~3.5% to ~10% after 4× oversampling. The class recovered without materially harming the VisDrone classes. The drone AP50 is evaluated against the sim val set (VisDrone val contains no drone targets); it represents precision when the model fires on a drone, not recall across a large drone val set.

7. Discussion

7.1 Sim-to-real gap in mAP reporting

V1's mAP50 = 0.471 on a sim val set is not comparable to v2/v3's mAP50 on a real-world val set. Treating the v1 number as a baseline and the v2 number as a regression would be a measurement error. The correct interpretation: v2 and v3 provide the first meaningful out-of-distribution evaluation because they use VisDrone val, which the model has not seen during training.

7.2 Class imbalance as the dominant failure mode

The drone collapse in v2 was caused entirely by class imbalance. At 3.5% of total box count, the gradient signal for the drone class was insufficient to maintain the v1 representations acquired from sim. The threshold is approximately 5% — below that, a class in a mixed-corpus training run is at risk of collapse without explicit rebalancing.

The fix — 4× image-level oversampling of drone-containing images — is effective and simple. More principled alternatives (copy-paste augmentation, class-weighted focal loss, synthetic drone compositing) remain for future work.

7.3 Missing classes

Landing_pad, powerline_pole, animal, and boat have zero to near-zero training data across all three versions. These classes require either additional sim generation (for landing_pad and powerline_pole) or public dataset sourcing (for animal and boat at altitude). They are included in the schema now so inference code does not need to change when data becomes available.

8. Deployment

The v3 ONNX (12 MB, FP32) is uploaded to S3 and downloaded by the drone setup script. The perception layer auto-selects the domain detector over the COCO model when it is present, using the same ONNX inference path. TensorRT int8 export targeting the Jetson Orin Nano is the next step; preliminary benchmarks suggest 8–12 FPS at int8 on the Nano's integrated GPU, which is within budget for the perception pipeline.

9. Conclusion

We trained a 9-class domain object detector for aerial autonomy across three rounds of training. The main result is not the final mAP number — it is the failure mode: class imbalance at the 3–5% box-fraction level causes rare classes to collapse in mixed-corpus training. Simple image-level oversampling recovers the class. The next phase is expanding rare-class coverage (landing_pad, powerline_pole) via sim data generation and running the model on the Jetson Orin Nano in hardware-in-the-loop evaluation.

See the blog post for an accessible walkthrough: We Trained a Domain Detector for Drones. One Class Collapsed to Zero.