QUICK REVIEW

[论文解读] TaskCLIP: Extend Large Vision-Language Model for Task Oriented Object Detection

Hanning Chen, Wenjun Huang|arXiv (Cornell University)|Mar 12, 2024

Advanced Image and Video Retrieval Techniques被引用 5

一句话总结

TaskCLIP 引入一个两阶段、基于 VLM 的框架，用于面向任务的目标检测，将视觉和基于形容词的文本嵌入通过一个变换器对齐器和一个按组选择机制进行对齐，在单卡 RTX 4090 上实现了 COCO-Tasks 的最新结果。

ABSTRACT

Task-oriented object detection aims to find objects suitable for accomplishing specific tasks. As a challenging task, it requires simultaneous visual data processing and reasoning under ambiguous semantics. Recent solutions are mainly all-in-one models. However, the object detection backbones are pre-trained without text supervision. Thus, to incorporate task requirements, their intricate models undergo extensive learning on a highly imbalanced and scarce dataset, resulting in capped performance, laborious training, and poor generalizability. In contrast, we propose TaskCLIP, a more natural two-stage design composed of general object detection and task-guided object selection. Particularly for the latter, we resort to the recently successful large Vision-Language Models (VLMs) as our backbone, which provides rich semantic knowledge and a uniform embedding space for images and texts. Nevertheless, the naive application of VLMs leads to sub-optimal quality, due to the misalignment between embeddings of object images and their visual attributes, which are mainly adjective phrases. To this end, we design a transformer-based aligner after the pre-trained VLMs to re-calibrate both embeddings. Finally, we employ a trainable score function to post-process the VLM matching results for object selection. Experimental results demonstrate that our TaskCLIP outperforms the state-of-the-art DETR-based model TOIST by 3.5% and only requires a single NVIDIA RTX 4090 for both training and inference.

研究动机与目标

在数据稀缺和不平衡的条件下，动机是将面向任务的目标检测视为同时需要视觉处理和任务驱动推理的任务。
提出一个两阶段框架，利用预训练的视觉语言模型实现鲁棒且可泛化的检测。
通过一个细粒度对齐模块，将视觉属性（形容词）与图像嵌入相连接。
通过保持 VLM 冻结并用变换器对齐器重新校准嵌入来降低训练成本并提升泛化能力。

提出的方法

使用大型语言模型为每个任务提取与任务相关的视觉属性。
使用通用目标检测器生成边界框，并裁剪图像块以供 VLM 处理。
应用基于变换器的对齐器重新校准视觉和文本嵌入，将形容词与视觉属性对齐。
通过将重新校准的文本和视觉嵌入相乘来计算亲和矩阵，以对 bbox-attribute 对进行评分。
使用具有自注意力的可训练分数函数为每个边界框生成任务适用性分数。
采用按组选择机制，通过在同一 COCO 类内传播高置信度预测来缓解假阴性。

实验结果

研究问题

RQ1在 COCO-Tasks 上，利用冻结的大型视觉语言模型的两阶段框架是否能超越基于 DETR 的面向任务的检测器？
RQ2如何将形容词（视觉属性）的嵌入与对象视觉特征对齐，以改进面向任务的选择？
RQ3按组选择机制是否能缓解类别不平衡并在 COCO-Tasks 中减少假阴性？

主要发现

TaskCLIP 在 COCO-Tasks 上以 mAP@0.5 提升 3.5% 超过最先进的基于 DETR 的 TOIST。
使用单卡 RTX 4090，TaskCLIP 训练和推理高效，优于更重的 DETR 基模型。
引入变换器对齐器显著改善对象视觉与形容词属性之间的对齐，相较基线获得约 20% 的 mAP@0.5 增益。
按组选择机制在数据不平衡下减少假阴性并提升 mean AP@0.5。
TaskCLIP 在 COCO-Tasks 上的 mean AP@0.5 为 45.5%（TaskCLIP）和 47.4%（TaskCLIP*，带优化），如表 3 所示。
该方法通过避免对 VLM 和目标检测器的端到端微调，保持端到端训练效率。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。