[論文レビュー] Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation
This paper benchmarks five open-vocabulary detectors on aerial imagery (LAE-80C) under zero-shot settings, showing severe domain transfer gaps dominated by semantic confusion rather than localization. Prompting and vocabulary reductions help little.
Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.
研究の動機と目的
- Assess zero-shot transfer of open-vocabulary detectors trained on ground-level images to aerial imagery.
- Investigate how vocabulary size and prompting strategies affect aerial OVD performance.
- Isolate semantic confusion from visual localization to identify primary bottlenecks in aerial OVD?
- Provide baseline expectations for UAV deployment and guide domain-adaptive future work.
提案手法
- Evaluate five OVD models (Grounding DINO, OWLv2, YOLO-World, YOLO-E, LLMDet) on LAE-80C in zero-shot mode without aerial fine-tuning.
- Use three inference modes (Global, Oracle, Single-Category) to separate semantic confusion from localization.
- Apply prompt engineering (Aerial view of {category}) and synonym expansion to mitigate lexical gaps.
- Post-process predictions with box and text thresholds, then class-wise NMS and final score threshold.
- Evaluate with Precision, Recall, F1, TP, FP, FN, using IoA=0.7 instead of IoU for aerial scale variance.

実験結果
リサーチクエスチョン
- RQ1How well do general-purpose open-vocabulary detectors transfer to aerial imagery without fine-tuning?
- RQ2To what extent does semantic confusion (vocabulary overlap) limit performance versus visual localization?
- RQ3Do prompt engineering techniques meaningfully improve aerial OVD performance?
- RQ4How does model performance vary across different aerial datasets with varying scale and density?
- RQ5What baseline operating characteristics emerge for potential real-world UAV deployment?
主な発見
| Model | Precision | Recall | F1 Score | TP | FP | FN |
|---|---|---|---|---|---|---|
| OWLv2 | 0.313 | 0.247 | 0.276 | 21408 | 47058 | 65150 |
| LLMDet | 0.441 | 0.073 | 0.125 | 6308 | 8009 | 80250 |
| DINO-Separate | 0.604 | 0.039 | 0.074 | 3409 | 2233 | 83149 |
| DINO-Batch | 0.697 | 0.031 | 0.059 | 2650 | 1151 | 83908 |
| DINO-Synonyms | 0.628 | 0.026 | 0.050 | 2244 | 1327 | 84314 |
| DINO-AerialPhrase | 0.926 | 0.023 | 0.045 | 2014 | 160 | 84544 |
| YOLOE | 0.367 | 0.020 | 0.039 | 1767 | 3045 | 84791 |
| YOLO-World | 0.421 | 0.015 | 0.028 | 1269 | 1744 | 85289 |
| DINO-AllClasses | 0.111 | 0.003 | 0.005 | 218 | 1748 | 86340 |
- OWLv2 is the best-performing model in zero-shot aerial transfer but with extremely low precision (69% false positives) and F1=27.6%.
- Semantic confusion dominates; reducing vocabulary from 80 to 3.2 classes yields a 15× F1-score improvement.
- Prompt engineering (Aerial prefix, synonyms) provides minimal or negative gains.
- Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), showing brittleness to imaging conditions.
- Conservative detectors achieve high precision but very low recall; aggressive detectors have high recall but many false positives.
- No model achieves F1 above 28%, indicating current open-vocabulary approaches are not yet suitable for autonomous aerial deployment.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。