QUICK REVIEW

[论文解读] Open-Text Aerial Detection: A Unified Framework For Aerial Visual Grounding And Detection

Guoting Wei, Xia Yuan|arXiv (Cornell University)|Feb 8, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

OTA-Det 将开放词汇 aerial detection (OVAD) 与 remote-sensing visual grounding (RSVG) 统一到一个单一、实时的框架中，支持多粒度文本输入和多目标检测，具有密集监督与属性级对齐。

ABSTRACT

Open-Vocabulary Aerial Detection (OVAD) and Remote Sensing Visual Grounding (RSVG) have emerged as two key paradigms for aerial scene understanding. However, each paradigm suffers from inherent limitations when operating in isolation: OVAD is restricted to coarse category-level semantics, while RSVG is structurally limited to single-target localization. These limitations prevent existing methods from simultaneously supporting rich semantic understanding and multi-target detection. To address this, we propose OTA-Det, the first unified framework that bridges both paradigms into a cohesive architecture. Specifically, we introduce a task reformulation strategy that unifies task objectives and supervision mechanisms, enabling joint training across datasets from both paradigms with dense supervision signals. Furthermore, we propose a dense semantic alignment strategy that establishes explicit correspondence at multiple granularities, from holistic expressions to individual attributes, enabling fine-grained semantic understanding. To ensure real-time efficiency, OTA-Det builds upon the RT-DETR architecture, extending it from closed-set detection to open-text detection by introducing several high efficient modules, achieving state-of-the-art performance on six benchmarks spanning both OVAD and RSVG tasks while maintaining real-time inference at 34 FPS.

研究动机与目标

桥接 OVAD 和 RSVG，构建一个统一框架，以实现空中图像的多粒度语义理解和多目标检测。
重新表述任务，使 OVAD 和 RSVG 的目标与监督密度在联合训练中对齐。
引入密集语义对齐，将整体表达与单个属性联系起来。
开发高效架构（基于 RT-DETR），支持 34 FPS 的开放文本检测。
在六个 OVAD 和 RSVG 基准上展示最新性能，同时保持实时推理。

提出的方法

任务重述：将 RSVG 从纯定位改为联合分类-定位，并聚合图像级注释以为 OVAD 和 RSVG 数据集创建密集监督。
密集语义对齐：用一个大型语言模型将指称表达分解为属性集，构建统一对应矩阵，以实现多粒度的视觉-语言监督。
属性级数据分解：从表达中提取目标中心属性（类别、颜色、空间关系）作为逐字子串并对其进行分类。
统一对应矩阵：保留 Object-Query (Q) 与 Object-Attribute (A) 矩阵，以及映射 M_map，以实现多标签、一对多的接地与层次化属性聚合。
OTA-Det 架构：采用多模态骨干（图像编码器 + 查询与属性的文本编码器）与解耦的多粒度头，使用对比头 V(T)S 来对整体查询和属性分别计算相似性 logits。
多任务损失：将定位损失与语义对齐损失结合，使用 MAL（Matchability-Aware Loss）将视觉-语言信号与 IoU 作为软目标对齐。

实验结果

研究问题

RQ1OVAD 与 RSVG 能否有效地统一为一个框架，支持同时进行多目标检测和对复杂指称表达的空中场景？
RQ2密集的多粒度语义对齐是否提高了定位精准度并减少了语义伪对齐，相较于整体句子级方法？
RQ3在 OVAD 与 RSVG 数据上的联合训练是否提供对开放词汇、多目标检测有益的密集监督，同时保持实时性能？
RQ4解耦的多粒度头和属性级监督如何影响对细粒度属性的理解与组合查询？

主要发现

OTA-Det 在涵盖 OVAD 与 RSVG 任务的六项基准上达到最新性能。
在 OTA-Mix 数据集上的联合训练在 OVAD 指标（AP@50、mAP）与 RSVG 指标（Acc@0.5）上表现出色，并对基线有改进。
OTA-Det 能以 34 FPS 保持实时推理。
消融实验表明，图像级注释聚合与属性级对齐显著降低语义伪对齐并提升 Attr-Align 分数。
解耦的多粒度头通过避免整体信号与属性信号的干扰来改善对齐。
单任务（针对任务的）训练在 RSVG 上可能略高，但统一模型在联合训练下在各任务上仍具竞争力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。