QUICK REVIEW

[논문 리뷰] VIPA: Visual Informative Part Attention for Referring Image Segmentation

Yubin Cho, Hyunwoo Yu|arXiv (Cornell University)|2026. 02. 16.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

VIPA는 Transformer 기반 RIS 디코더에서 Visual Expression을 키-밸류 세트로 사용하여 Visual Informative Part Attention을 도입하고, Local-Global 언어적 신호에서 유용한 시각 토큰을 검색하고 정제하는 Visual Expression Generator를 통해 세밀한 분할을 안내합니다.

ABSTRACT

Referring Image Segmentation (RIS) aims to segment a target object described by a natural language expression. Existing methods have evolved by leveraging the vision information into the language tokens. To more effectively exploit visual contexts for fine-grained segmentation, we propose a novel Visual Informative Part Attention (VIPA) framework for referring image segmentation. VIPA leverages the informative parts of visual contexts, called a visual expression, which can effectively provide the structural and semantic visual target information to the network. This design reduces high-variance cross-modal projection and enhances semantic consistency in an attention mechanism of the referring image segmentation. We also design a visual expression generator (VEG) module, which retrieves informative visual tokens via local-global linguistic context cues and refines the retrieved tokens for reducing noise information and sharing informative visual attributes. This module allows the visual expression to consider comprehensive contexts and capture semantic visual contexts of informative regions. In this way, our framework enables the network's attention to robustly align with the fine-grained regions of interest. Extensive experiments and visual analysis demonstrate the effectiveness of our approach. Our VIPA outperforms the existing state-of-the-art methods on four public RIS benchmarks.

연구 동기 및 목표

RIS에서 정보성 시각 맥락을 활용하여 교차 모달 정합을 개선하려는 동기를 제공합니다.
시맨틱하고 구조적인 시각 목표 정보를 세그먼트 디코더에 제공하기 위해 Visual Informative Part Attention (VIPA)를 도입합니다.
Local-Global 언어 신호를 사용하여 정보를 가리는 시각 토큰을 검색하고 정제하는 Visual Expression Generator (VEG)를 개발합니다.
네 가지 공개 RIS 벤치마크에서 VIPA가 주의 집중 일관성과 세그먼트 정확도 향상을 보여줍니다.

제안 방법

VIPA를 제안합니다. 여기서 정보성 시각 파트(Visual Expression)가 Transformer 기반 세그먼테이션 디코더의 시각 쿼리에 대한 키-값 세트로 작용합니다.
두 단계로 구성된 Visual Expression Generator (VEG)를 도입합니다: (i) 로컬-글로벌 언어 신호를 사용해 정보성 시각 토큰을 선택하는 Visual Informative Token Retrieval(코사인 유사도 및 differentiable sampling 이용); (ii) 동적 마스크된 교차 주의(attention)를 사용한 Visual Context Refinement로 노이즈를 완화하고 토큰 간 속성을 공유합니다.
검색된 Visual Expression 토큰을 세그먼테이션 디코더의 마스크된 다중-헤드 교차 주의에서 키-값 세트로 사용하여 미세한 영역으로의 주의를 안내합니다.
세그먼테이션에 대한 이진 크로스 엔트로피와 다이스 손실의 조합으로 모델을 학습하고, 검색된 토큰의 관련성 맵을 감독하기 위해 픽셀 대조 손실을 사용합니다.
다양한 비전-언어 인코더 융합 전략에서도 VIPA가 성능을 향상시킴을 보여주어 인코더 유형에 구애받지 않는 것을 시연합니다.

실험 결과

연구 질문

RQ1RIS 세그먼트에서 비전 쿼리를 안내하기 위한 효과적인 키-밸류 세트는 무엇인가?
RQ2정보성 시각 컨텍스트 토큰(Visual Expression)이 언어 기반 키/값에 비해 교차 모달 정합과 세밀한 분할을 개선할 수 있는가?
RQ3로컬-글로벌 언어 신호를 사용해 정보를 검색하고 정제하는 Visual Expression Generator가 주의 집중을 효과적으로 가이드하는가?
RQ4VIPA는 서로 다른 인코더 및 융합 전략에서 강건하며 보지 못한 대상에도 일반화되는가?

주요 결과

VIPA는 네 가지 공개 벤치마크에서 기존 최첨단 RIS 방법을 능가합니다.
The Visual Expression provides aligned key–value representations in the visual feature space, reducing modality projection entropy compared to language-based keys.
The Visual Expression Generator (VEG) improves retrieval and refinement of informative tokens, yielding substantial gains on challenging datasets (notably RefCOCOg).
VIPA demonstrates encoder-type agnosticism and remains effective across early-, late-, and no-fusion configurations.
Ablation studies show that removing retrieval or refinement steps degrades performance, and that using local-global linguistic cues for retrieval is beneficial.
Compared to LLM-based RIS methods, VIPA achieves competitive accuracy with significantly lower computational cost and faster inference.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.