Skip to main content
QUICK REVIEW

[논문 리뷰] SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo, Yicheng Xiao|arXiv (Cornell University)|2023. 05. 26.
Multimodal Machine Learning Applications인용 수 14
한 줄 요약

SOC는 Temporal modeling과 cross-modal alignment를 RVOS에 적용하기 위해 Semantic Integration Module과 video-level object cluster를 활용하고, visual-linguistic contrastive learning의 도움으로 state-of-the-art 결과를 더 빠른 추론 속도와 함께 달성한다.

ABSTRACT

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.

연구 동기 및 목표

  • Motivate RVOS to leverage a global video view to better model inter-frame relationships and temporal variations in language descriptions.
  • Propose a Semantic Integration Module (SIM) to aggregate intra-frame and inter-frame information for video-level understanding.
  • Introduce a two-stream multi-modal fusion (MMF) and a video-level object cluster to jointly model objects across time and language guidance.
  • Apply a visual-linguistic contrastive loss to align video-level object representations with textual guidance.
  • Demonstrate state-of-the-art performance on RVOS benchmarks with improved stability and real-time inference speed.

제안 방법

  • Encode video with a spatial-temporal backbone and text with a transformer-based language encoder.
  • Use a two-stream MMF (Language-to-Vision and Vision-to-Language) to perform cross-modal alignment across multiple visual scales.
  • Develop a Semantic Integration Module (SIM) that performs frame-level content aggregation via deformable transformers and a video-level object cluster that groups same objects across frames using video-level queries initialized from language features.
  • Introduce a visual-linguistic contrastive loss to align video-level object queries with a textual guidance embedding.
  • Incorporate three lightweight prediction heads (classification, box, and dynamic mask kernels) and apply a Hungarian assignment for trajectory supervision.
  • Train with a combination of mask, box, class, and contrastive losses to optimize the joint video-language space and segmentation quality.

실험 결과

연구 질문

  • RQ1How can RVOS benefit from a global video-level view to better capture temporal variations described by language?
  • RQ2Can aggregating frame-level object embeddings into video-level clusters improve cross-modal alignment and segmentation stability across frames?
  • RQ3Does a visual-linguistic contrastive objective help align video-level representations with textual guidance for referring segments?
  • RQ4What is the impact of video-level modeling on inference speed and robustness to temporal expressions?

주요 결과

  • SOC outperforms state-of-the-art RVOS methods on major benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, JHMDB-Sentences).
  • The Video-Level Object Cluster (VOC) and Visual-Linguistic (VL) contrastive learning individually improve performance, with their combination yielding further gains in J&F, J, and F.
  • SOC achieves real-time-like inference speed (32.3 FPS on a single 3090 GPU) compared to prior SOTA (ReferFormer at 21.4 FPS).
  • Temporal coherence is enhanced, reducing segmentation variance across frames when processing text expressions with temporal variations.
  • Ablation shows the necessity of both L2V fusion and temporal inter-frame aggregation for strong performance.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.