QUICK REVIEW

[論文レビュー] VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu|arXiv (Cornell University)|Feb 16, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

VIPAは、Transformerベースの RIS デコーダのキー値セットとして Visual Expression を使用し、地域-global 言語的手掛かりから情報豊かな視覚トークンを取得・洗練する Visual Expression Generator を備え、細粒度のセグメンテーションを導く視覚的情報部分Attentionを導入する。

ABSTRACT

Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

研究の動機と目的

RISにおける情報豊かな視覚文脈を活用して視覚情報を言語トークンへ投影するのではなく、跨モダリティの整合性を改善する動機づけ。
Visual Informative Part Attention (VIPA) を導入し、セグメンテーションデコーダへ意味論的・構造的な視覚ターゲット情報を提供。
Local-Global 言語的手掛かりを用いて情報豊かな視覚トークンを取得・洗練する Visual Expression Generator (VEG) を開発。
VIPAが4つの公開 RIS ベンチマークで注意とセグメンテーション精度の整合性を改善することを示す。

提案手法

informative visual parts（Visual Expression）が Transformer ベースのセグメンテーションデコーダにおける vision queries のキー–バリューセットとして機能するVIPAを提案。
二段階の Visual Expression Generator (VEG)： (i) Local-Global 言語的手掛かりを用いた Visual Informative Token Retrieval により、コサイン類似度と微分可能サンプリングを用いて情報豊かな視覚トークンを選択； (ii) Dynamic masked cross-attention を用いた Visual Context Refinement によりノイズを緩和しトークン間の属性を共有。
retrieved Visual Expression tokens を 영역あたりのファイングレインな領域への注意を導くために、セグメンテーションデコーダ内の masked multi-head cross-attention のキー–バリューセットとして使用。
セグメンテーションのバイナリクロスエントロピーと dice 損失の組み合わせ、および retrieved tokens の関連性マップを監視するピクセル対比損失を用いてモデルを訓練。
VIPA がエンコーダ型に依存しない汎用性を示し、さまざまな vision-language エンコーダ統合戦略で性能を向上させることをデモ。

実験結果

リサーチクエスチョン

RQ1RIS セグメンテーションにおいて視覚クエリを導く効果的なキー–バリューセットとは何か。
RQ2情報豊かな視覚文脈トークン（Visual Expression）は、言語ベースのキー/値と比較して跨モダリティの整合性と細粒度セグメンテーションを改善できるか。
RQ3Visual Expression Generator は local-global 言語的手掛かりを使って情報豊かな視覚トークンを効果的に取得・洗練して注意を導けるか。
RQ4VIPA は異なるエンコーダやフュージョン戦略に対して頑健であり、未知のターゲットにも一般化できるか。

主な発見

VIPA は4つの公開ベンチマークで既存の最先端 RIS 手法を上回る。
Visual Expression は視覚特徴空間で整合したキー–バリュー表現を提供し、言語ベースのキーと比較してモダリティ投影のエントロピーを低減する。
Visual Expression Generator (VEG) は情報豊かなトークンの取得と洗練を改善し、難易度の高いデータセット（特に RefCOCOg ）で大きな改善をもたらす。
VIPA はエンコーダ型に依存しない汎用性を示し、 early-, late-, no-fusion 構成のいずれにおいても効果的である。
アブレーション研究では取得や洗練のいずれかを欠くと性能が低下し、local-global 言語的手掛かりを用いた取得が有益であることを示す。
LLMベースの RIS 手法と比較して、VIPA は競争力のある精度を維持しつつ計算コストが大幅に低く、推論が速い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。