[논문 리뷰] SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation
SOC는 Temporal modeling과 cross-modal alignment를 RVOS에 적용하기 위해 Semantic Integration Module과 video-level object cluster를 활용하고, visual-linguistic contrastive learning의 도움으로 state-of-the-art 결과를 더 빠른 추론 속도와 함께 달성한다.
This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.
연구 동기 및 목표
- Motivate RVOS to leverage a global video view to better model inter-frame relationships and temporal variations in language descriptions.
- Propose a Semantic Integration Module (SIM) to aggregate intra-frame and inter-frame information for video-level understanding.
- Introduce a two-stream multi-modal fusion (MMF) and a video-level object cluster to jointly model objects across time and language guidance.
- Apply a visual-linguistic contrastive loss to align video-level object representations with textual guidance.
- Demonstrate state-of-the-art performance on RVOS benchmarks with improved stability and real-time inference speed.
제안 방법
- Encode video with a spatial-temporal backbone and text with a transformer-based language encoder.
- Use a two-stream MMF (Language-to-Vision and Vision-to-Language) to perform cross-modal alignment across multiple visual scales.
- Develop a Semantic Integration Module (SIM) that performs frame-level content aggregation via deformable transformers and a video-level object cluster that groups same objects across frames using video-level queries initialized from language features.
- Introduce a visual-linguistic contrastive loss to align video-level object queries with a textual guidance embedding.
- Incorporate three lightweight prediction heads (classification, box, and dynamic mask kernels) and apply a Hungarian assignment for trajectory supervision.
- Train with a combination of mask, box, class, and contrastive losses to optimize the joint video-language space and segmentation quality.
실험 결과
연구 질문
- RQ1How can RVOS benefit from a global video-level view to better capture temporal variations described by language?
- RQ2Can aggregating frame-level object embeddings into video-level clusters improve cross-modal alignment and segmentation stability across frames?
- RQ3Does a visual-linguistic contrastive objective help align video-level representations with textual guidance for referring segments?
- RQ4What is the impact of video-level modeling on inference speed and robustness to temporal expressions?
주요 결과
- SOC outperforms state-of-the-art RVOS methods on major benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, JHMDB-Sentences).
- The Video-Level Object Cluster (VOC) and Visual-Linguistic (VL) contrastive learning individually improve performance, with their combination yielding further gains in J&F, J, and F.
- SOC achieves real-time-like inference speed (32.3 FPS on a single 3090 GPU) compared to prior SOTA (ReferFormer at 21.4 FPS).
- Temporal coherence is enhanced, reducing segmentation variance across frames when processing text expressions with temporal variations.
- Ablation shows the necessity of both L2V fusion and temporal inter-frame aggregation for strong performance.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.