QUICK REVIEW

[论文解读] VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu|arXiv (Cornell University)|Feb 16, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

VIPA 使用 Visual Expression 作为 RIS 的注意力键值对，在 Transformer 基于的解码器中，Visual Expression Generator 通过检索并 refined 信息丰富的视觉令牌来引导细粒度分割，达到更好的对齐与分割效果。

ABSTRACT

Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

研究动机与目标

通过利用信息丰富的视觉上下文而不是将视觉信息投影到语言标记中来实现 RIS 的跨模态对齐改进。
引入 Visual Informative Part Attention（VIPA），为分割解码器提供语义和结构化的视觉目标信息。
开发 Visual Expression Generator（VEG），利用局部-全局语言线索检索并 refined 信息丰富的视觉令牌。
证明VIPA 在四个公开 RIS 基准上能够实现更好的注意力一致性和分割精度。

提出的方法

提出 VIPA，其中信息丰富的视觉部件（Visual Expression）作为 Transformer 基于的分割解码器中 vision 查询的键–值集合。
引入 Visual Expression Generator（VEG），包含两步：（i）使用局部-全局语言线索通过余弦相似性和可微采样选择信息丰富的视觉令牌来进行 Visual Informative Token Retrieval；（ii）通过动态掩码交叉注意力进行视觉上下文 refined，以降低噪声并在令牌之间共享属性。
在分割解码器中的 masked multi-head cross-attention 将检索得到的 Visual Expression 令牌用作键–值集合，以引导注意力关注细粒度区域。
使用二元交叉熵损失与 Dice 损失相结合进行分割训练，并采用像素对比损失来监督检索到的令牌的相关性图。
通过展示 VIPA 对不同视觉-语言编码器融合策略的提升，证明编码器类型无关性。

实验结果

研究问题

RQ1在 RIS 分割中，哪些构成有效的键–值集合以引导视觉查询？
RQ2与基于语言的键/值相比，信息丰富的视觉上下文令牌（Visual Expression）是否能够改善跨模态对齐与细粒度分割？
RQ3Visual Expression Generator 是否能够使用局部-全局语言线索有效检索并 refined 信息丰富的视觉令牌以引导注意力？
RQ4VIPA 是否在不同编码器与融合策略下具有鲁棒性，能否推广到未见目标？

主要发现

VIPA 在四个公开基准上优于现有的 RIS 方法。
Visual Expression 在视觉特征空间中提供对齐的键–值表示，相较语言键减少模态投影熵。
Visual Expression Generator（VEG）提升了信息令牌的检索与 refined，在具有挑战性的数据集（尤其是 RefCOCOg）上带来显著收益。
VIPA 呈现出对编码器类型的无关性，在早期、晚期以及无融合配置下均有效。
消融研究显示若移除检索或 refined 步骤性能下降，使用局部-全局语言线索进行检索也更有利。
相较于基于大型语言模型的 RIS 方法，VIPA 在准确性方面具有竞争力，同时显著降低计算成本并加速推理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。