QUICK REVIEW

[논문 리뷰] SOC: Semantic-Assisted Object Cluster for Referring Video Object Segmentation

Zhuoyan Luo, Yicheng Xiao|arXiv (Cornell University)|2023. 05. 26.

Multimodal Machine Learning Applications인용 수 14

한 줄 요약

SOC는 Temporal modeling과 cross-modal alignment를 RVOS에 적용하기 위해 Semantic Integration Module과 video-level object cluster를 활용하고, visual-linguistic contrastive learning의 도움으로 state-of-the-art 결과를 더 빠른 추론 속도와 함께 달성한다.

ABSTRACT

This paper studies referring video object segmentation (RVOS) by boosting video-level visual-linguistic alignment. Recent approaches model the RVOS task as a sequence prediction problem and perform multi-modal interaction as well as segmentation for each frame separately. However, the lack of a global view of video content leads to difficulties in effectively utilizing inter-frame relationships and understanding textual descriptions of object temporal variations. To address this issue, we propose Semantic-assisted Object Cluster (SOC), which aggregates video content and textual guidance for unified temporal modeling and cross-modal alignment. By associating a group of frame-level object embeddings with language tokens, SOC facilitates joint space learning across modalities and time steps. Moreover, we present multi-modal contrastive supervision to help construct well-aligned joint space at the video level. We conduct extensive experiments on popular RVOS benchmarks, and our method outperforms state-of-the-art competitors on all benchmarks by a remarkable margin. Besides, the emphasis on temporal coherence enhances the segmentation stability and adaptability of our method in processing text expressions with temporal variations. Code will be available.

연구 동기 및 목표

Motivate RVOS to leverage a global video view to better model inter-frame relationships and temporal variations in language descriptions.
Propose a Semantic Integration Module (SIM) to aggregate intra-frame and inter-frame information for video-level understanding.
Introduce a two-stream multi-modal fusion (MMF) and a video-level object cluster to jointly model objects across time and language guidance.
Apply a visual-linguistic contrastive loss to align video-level object representations with textual guidance.
Demonstrate state-of-the-art performance on RVOS benchmarks with improved stability and real-time inference speed.

제안 방법

Encode video with a spatial-temporal backbone and text with a transformer-based language encoder.
Use a two-stream MMF (Language-to-Vision and Vision-to-Language) to perform cross-modal alignment across multiple visual scales.
Develop a Semantic Integration Module (SIM) that performs frame-level content aggregation via deformable transformers and a video-level object cluster that groups same objects across frames using video-level queries initialized from language features.
Introduce a visual-linguistic contrastive loss to align video-level object queries with a textual guidance embedding.
Incorporate three lightweight prediction heads (classification, box, and dynamic mask kernels) and apply a Hungarian assignment for trajectory supervision.
Train with a combination of mask, box, class, and contrastive losses to optimize the joint video-language space and segmentation quality.

실험 결과

연구 질문

RQ1How can RVOS benefit from a global video-level view to better capture temporal variations described by language?
RQ2Can aggregating frame-level object embeddings into video-level clusters improve cross-modal alignment and segmentation stability across frames?
RQ3Does a visual-linguistic contrastive objective help align video-level representations with textual guidance for referring segments?
RQ4What is the impact of video-level modeling on inference speed and robustness to temporal expressions?

주요 결과

SOC outperforms state-of-the-art RVOS methods on major benchmarks (Ref-YouTube-VOS, Ref-DAVIS17, A2D-Sentences, JHMDB-Sentences).
The Video-Level Object Cluster (VOC) and Visual-Linguistic (VL) contrastive learning individually improve performance, with their combination yielding further gains in J&F, J, and F.
SOC achieves real-time-like inference speed (32.3 FPS on a single 3090 GPU) compared to prior SOTA (ReferFormer at 21.4 FPS).
Temporal coherence is enhanced, reducing segmentation variance across frames when processing text expressions with temporal variations.
Ablation shows the necessity of both L2V fusion and temporal inter-frame aggregation for strong performance.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.