QUICK REVIEW

[論文レビュー] Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation

Christos Tsourveloudis|arXiv (Cornell University)|Jan 13, 2026

Domain Adaptation and Few-Shot Learning被引用数 0

ひとこと要約

This paper benchmarks five open-vocabulary detectors on aerial imagery (LAE-80C) under zero-shot settings, showing severe domain transfer gaps dominated by semantic confusion rather than localization. Prompting and vocabulary reductions help little.

ABSTRACT

Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.

研究の動機と目的

Assess zero-shot transfer of open-vocabulary detectors trained on ground-level images to aerial imagery.
Investigate how vocabulary size and prompting strategies affect aerial OVD performance.
Isolate semantic confusion from visual localization to identify primary bottlenecks in aerial OVD?
Provide baseline expectations for UAV deployment and guide domain-adaptive future work.

提案手法

Evaluate five OVD models (Grounding DINO, OWLv2, YOLO-World, YOLO-E, LLMDet) on LAE-80C in zero-shot mode without aerial fine-tuning.
Use three inference modes (Global, Oracle, Single-Category) to separate semantic confusion from localization.
Apply prompt engineering (Aerial view of {category}) and synonym expansion to mitigate lexical gaps.
Post-process predictions with box and text thresholds, then class-wise NMS and final score threshold.
Evaluate with Precision, Recall, F1, TP, FP, FN, using IoA=0.7 instead of IoU for aerial scale variance.

Figure 2: Comparison of Precision, Recall, and F1 Score for all evaluated models.

実験結果

リサーチクエスチョン

RQ1How well do general-purpose open-vocabulary detectors transfer to aerial imagery without fine-tuning?
RQ2To what extent does semantic confusion (vocabulary overlap) limit performance versus visual localization?
RQ3Do prompt engineering techniques meaningfully improve aerial OVD performance?
RQ4How does model performance vary across different aerial datasets with varying scale and density?
RQ5What baseline operating characteristics emerge for potential real-world UAV deployment?

主な発見

Model	Precision	Recall	F1 Score	TP	FP	FN
OWLv2	0.313	0.247	0.276	21408	47058	65150
LLMDet	0.441	0.073	0.125	6308	8009	80250
DINO-Separate	0.604	0.039	0.074	3409	2233	83149
DINO-Batch	0.697	0.031	0.059	2650	1151	83908
DINO-Synonyms	0.628	0.026	0.050	2244	1327	84314
DINO-AerialPhrase	0.926	0.023	0.045	2014	160	84544
YOLOE	0.367	0.020	0.039	1767	3045	84791
YOLO-World	0.421	0.015	0.028	1269	1744	85289
DINO-AllClasses	0.111	0.003	0.005	218	1748	86340

OWLv2 is the best-performing model in zero-shot aerial transfer but with extremely low precision (69% false positives) and F1=27.6%.
Semantic confusion dominates; reducing vocabulary from 80 to 3.2 classes yields a 15× F1-score improvement.
Prompt engineering (Aerial prefix, synonyms) provides minimal or negative gains.
Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), showing brittleness to imaging conditions.
Conservative detectors achieve high precision but very low recall; aggressive detectors have high recall but many false positives.
No model achieves F1 above 28%, indicating current open-vocabulary approaches are not yet suitable for autonomous aerial deployment.

Figure 3: True Positive (TP), False Positive (FP), and False Negative (FN) counts across models.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。