Skip to main content
QUICK REVIEW

[論文レビュー] Do Open-Vocabulary Detectors Transfer to Aerial Imagery? A Comparative Evaluation

Christos Tsourveloudis|arXiv (Cornell University)|Jan 13, 2026
Domain Adaptation and Few-Shot Learning被引用数 0
ひとこと要約

This paper benchmarks five open-vocabulary detectors on aerial imagery (LAE-80C) under zero-shot settings, showing severe domain transfer gaps dominated by semantic confusion rather than localization. Prompting and vocabulary reductions help little.

ABSTRACT

Open-vocabulary object detection (OVD) enables zero-shot recognition of novel categories through vision-language models, achieving strong performance on natural images. However, transferability to aerial imagery remains unexplored. We present the first systematic benchmark evaluating five state-of-the-art OVD models on the LAE-80C aerial dataset (3,592 images, 80 categories) under strict zero-shot conditions. Our experimental protocol isolates semantic confusion from visual localization through Global, Oracle, and Single-Category inference modes. Results reveal severe domain transfer failure: the best model (OWLv2) achieves only 27.6% F1-score with 69% false positive rate. Critically, reducing vocabulary size from 80 to 3.2 classes yields 15x improvement, demonstrating that semantic confusion is the primary bottleneck. Prompt engineering strategies such as domain-specific prefixing and synonym expansion, fail to provide meaningful performance gains. Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), exposing brittleness to imaging conditions. These findings establish baseline expectations and highlight the need for domain-adaptive approaches in aerial OVD.

研究の動機と目的

  • Assess zero-shot transfer of open-vocabulary detectors trained on ground-level images to aerial imagery.
  • Investigate how vocabulary size and prompting strategies affect aerial OVD performance.
  • Isolate semantic confusion from visual localization to identify primary bottlenecks in aerial OVD?
  • Provide baseline expectations for UAV deployment and guide domain-adaptive future work.

提案手法

  • Evaluate five OVD models (Grounding DINO, OWLv2, YOLO-World, YOLO-E, LLMDet) on LAE-80C in zero-shot mode without aerial fine-tuning.
  • Use three inference modes (Global, Oracle, Single-Category) to separate semantic confusion from localization.
  • Apply prompt engineering (Aerial view of {category}) and synonym expansion to mitigate lexical gaps.
  • Post-process predictions with box and text thresholds, then class-wise NMS and final score threshold.
  • Evaluate with Precision, Recall, F1, TP, FP, FN, using IoA=0.7 instead of IoU for aerial scale variance.
Figure 2: Comparison of Precision, Recall, and F1 Score for all evaluated models.
Figure 2: Comparison of Precision, Recall, and F1 Score for all evaluated models.

実験結果

リサーチクエスチョン

  • RQ1How well do general-purpose open-vocabulary detectors transfer to aerial imagery without fine-tuning?
  • RQ2To what extent does semantic confusion (vocabulary overlap) limit performance versus visual localization?
  • RQ3Do prompt engineering techniques meaningfully improve aerial OVD performance?
  • RQ4How does model performance vary across different aerial datasets with varying scale and density?
  • RQ5What baseline operating characteristics emerge for potential real-world UAV deployment?

主な発見

ModelPrecisionRecallF1 ScoreTPFPFN
OWLv20.3130.2470.276214084705865150
LLMDet0.4410.0730.1256308800980250
DINO-Separate0.6040.0390.0743409223383149
DINO-Batch0.6970.0310.0592650115183908
DINO-Synonyms0.6280.0260.0502244132784314
DINO-AerialPhrase0.9260.0230.045201416084544
YOLOE0.3670.0200.0391767304584791
YOLO-World0.4210.0150.0281269174485289
DINO-AllClasses0.1110.0030.005218174886340
  • OWLv2 is the best-performing model in zero-shot aerial transfer but with extremely low precision (69% false positives) and F1=27.6%.
  • Semantic confusion dominates; reducing vocabulary from 80 to 3.2 classes yields a 15× F1-score improvement.
  • Prompt engineering (Aerial prefix, synonyms) provides minimal or negative gains.
  • Performance varies dramatically across datasets (F1: 0.53 on DIOR, 0.12 on FAIR1M), showing brittleness to imaging conditions.
  • Conservative detectors achieve high precision but very low recall; aggressive detectors have high recall but many false positives.
  • No model achieves F1 above 28%, indicating current open-vocabulary approaches are not yet suitable for autonomous aerial deployment.
Figure 3: True Positive (TP), False Positive (FP), and False Negative (FN) counts across models.
Figure 3: True Positive (TP), False Positive (FP), and False Negative (FN) counts across models.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。