QUICK REVIEW

[论文解读] SimVG: A Simple Framework for Visual Grounding with Decoupled Multi-modal Fusion

Ming Dai, Lingfeng Yang|arXiv (Cornell University)|Sep 26, 2024

Advanced Image and Video Retrieval Techniques被引用 6

一句话总结

SimVG 将多模态融合与下游定位任务解耦，使用统一的多模态编码器，具备轻量级令牌分支与动态权重平衡蒸馏，以实现高效的最先进视觉定位。

ABSTRACT

Visual grounding is a common vision task that involves grounding descriptive sentences to the corresponding regions of an image. Most existing methods use independent image-text encoding and apply complex hand-crafted modules or encoder-decoder architectures for modal interaction and query reasoning. However, their performance significantly drops when dealing with complex textual expressions. This is because the former paradigm only utilizes limited downstream data to fit the multi-modal feature fusion. Therefore, it is only effective when the textual expressions are relatively simple. In contrast, given the wide diversity of textual expressions and the uniqueness of downstream training data, the existing fusion module, which extracts multimodal content from a visual-linguistic context, has not been fully investigated. In this paper, we present a simple yet robust transformer-based framework, SimVG, for visual grounding. Specifically, we decouple visual-linguistic feature fusion from downstream tasks by leveraging existing multimodal pre-trained models and incorporating additional object tokens to facilitate deep integration of downstream and pre-training tasks. Furthermore, we design a dynamic weight-balance distillation method in the multi-branch synchronous learning process to enhance the representation capability of the simpler branch. This branch only consists of a lightweight MLP, which simplifies the structure and improves reasoning speed. Experiments on six widely used VG datasets, i.e., RefCOCO/+/g, ReferIt, Flickr30K, and GRefCOCO, demonstrate the superiority of SimVG. Finally, the proposed method not only achieves improvements in efficiency and convergence speed but also attains new state-of-the-art performance on these benchmarks. Codes and models will be available at \url{https://github.com/Dmmm1997/SimVG}.

研究动机与目标

通过将多模态融合与下游任务解耦来提升视觉定位的性能。
利用预训练的多模态模型来增强跨模态交互，而不需要沉重的下游融合模块。
引入一个轻量级的基于令牌的分支和蒸馏机制，以提升效率与速度。
开发一个文本引导的查询生成模块，将文本先验注入对象查询。
在六个 VG 数据集上展示最先进的性能，并实现数据效率和收敛速度的提升。

提出的方法

使用基于 BEiT-3 风格架构的多模态编码器，对图像、文本和一个可学习的对象令牌进行编码。
采用双分支解码器：一个标准解码器分支（基于 Transformer）和一个使用 MLP 的轻量级令牌分支。
引入一个动态权重平衡蒸馏（DWBD），在同步学习期间对来自真实标签与解码器预测的引导进行时序平衡。
引入一个文本引导查询生成（TQG）模块，将文本先验注入对象查询。
使用蒸馏头进行训练，将 DETR 风格的 Hungarian 匹配损失与 DWBD 损失结合。
可选地使令牌和解码器分支分别使用以实现更快的推理（SimVG-TB/SimVG-DB）。
在将融合与任务特定定位解耦时，展示更好的收敛速度和数据效率。

实验结果

研究问题

RQ1将多模态融合与下游定位解耦是否能在处理复杂文本表达时提升性能？
RQ2在强解码器的引导下，轻量级令牌分支是否能以更低的计算实现有竞争力甚至优越的定位？
RQ3在同步训练中，动态权重平衡蒸馏（DWBD）在对齐令牌分支与解码器分支方面的效果如何？
RQ4文本引导查询生成（TQG）在处理扩展的 GREC 风格查询时是否提升定位？
RQ5SimVG 在标准 VG 基准上的数据效率与收敛性有何提升？

主要发现

SimVG 在六个 VG 数据集上实现了最先进的性能，包括 RefCOCO/+/g、ReferIt、Flickr30K 和 GRefCOCO。
在 DWBD 的帮助下，轻量级令牌分支可以匹配或超过解码器分支的性能，尤其是在更大的编码器时。
DWBD 在训练期间动态地将引导从真实标签切换到解码器预测，从而提升令牌分支学习。
TQG 通过将文本先验注入对象查询，带来可衡量的增益（在 RefCOCO 验证/测试平均约0.8点）。
SimVG 提供更快的收敛和更高的数据效率，在相对较小的预训练数据和适度计算下取得强结果（例如 ViT-B/32，在 RTX 3090 的 RefCOCO 变体上约 12 小时）。
SimVG-TB 和 SimVG-DB 变体实现高效推理与具有竞争力的精度，突出实际部署的好处。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。