QUICK REVIEW

[논문 리뷰] SeaVIS: Sound-Enhanced Association for Online Audio-Visual Instance Segmentation

Yingjian Zhu, Ying Wang|arXiv (Cornell University)|2026. 03. 02.

Music and Audio Processing인용 수 0

한 줄 요약

SeaVIS는 오디오 이력을 현재 시각 프레임과의 인과적 교차 어텐션 융합과 오디오 가이드 대조 학습을 활용하여 실시간으로 소리에 의한 인스턴스 연합을 개선하는 온라인 오디오-비주얼 인스턴스 세그먼테이션 프레임워크를 소개합니다. 실시간 속도로 AVISeg에서 최첨단 결과를 달성합니다.

ABSTRACT

Recently, an audio-visual instance segmentation (AVIS) task has been introduced, aiming to identify, segment and track individual sounding instances in videos. However, prevailing methods primarily adopt the offline paradigm, that cannot associate detected instances across consecutive clips, making them unsuitable for real-world scenarios that involve continuous video streams. To address this limitation, we introduce SeaVIS, the first online framework designed for audio-visual instance segmentation. SeaVIS leverages the Causal Cross Attention Fusion (CCAF) module to enable efficient online processing, which integrates visual features from the current frame with the entire audio history under strict causal constraints. A major challenge for conventional VIS methods is that appearance-based instance association fails to distinguish between an object's sounding and silent states, resulting in the incorrect segmentation of silent objects. To tackle this, we employ an Audio-Guided Contrastive Learning (AGCL) strategy to generate instance prototypes that encode not only visual appearance but also sounding activity. In this way, instances preserved during per-frame prediction that do not emit sound can be effectively suppressed during instance association process, thereby significantly enhancing the audio-following capability of SeaVIS. Extensive experiments conducted on the AVISeg dataset demonstrate that SeaVIS surpasses existing state-of-the-art models across multiple evaluation metrics while maintaining a competitive inference speed suitable for real-time processing.

연구 동기 및 목표

온라인 AVIS를 자극하여 스트리밍 비디오에서 연속적이고 프레임별 인스턴스 연합을 가능하게 한다.
온라인 처리에 대해 인과 제약 하에 전체 오디오 히스토리를 활용하는 융합 메커니즘을 개발한다.
시각적 외관과 발화 상태를 모두 인코딩하는 인스턴스 임베딩을 학습하여 무음 객체를 억제한다.
프레이-레벨 및 인스턴스 레벨에서 대조 학습을 도입하여 소리에 기반한 인스턴스 구분을 향상시킨다.
AVISeg 벤치마크에서 기준 성능 대비 실시간 성능과 더 우수한 정확성을 시연한다.

제안 방법

현재 프레임 시각 특징과 전체 오디오 히스토리를 인과 마스크 하에 융합하는 Causal Cross Attention Fusion (CCAF) 모듈을 제안한다.
오디오 특징을 시각 임베딩 차원에 맞추어 투영하고 멀티 스케일 시각 특징 간의 교차 어텐션으로 융합한다.
프레임별 분할을 위한 학습 가능한 질의어를 갖춘 Transformer 기반 디코더를 사용하고, 그 후 인스턴스 임베딩을 생성하는 MLP를 사용한다.
프레임 및 인스턴스 레벨에서 Audio-Guided Contrastive Learning (AGCL)을 도입하여 음향 활동을 임베딩에 인코딩한다.
프레임-레벨 및 인스턴스-레벨 InfoNCE 스타일 대조 손실을 적용하여 소리 나는 인스턴스와 소리 나지 않는 인스턴스를 구분하고 프레임 간에 소리 의식 추적을 유지한다.
표준 프레임 수준 분할 손실과 임베딩 및 AGCL 손실을 결합한 공동 손실로 학습하고, 추론은 모멘텀 임베딩을 가진 메모리 기반 트래커를 사용하여 프레임 간 연관을 수행한다.

실험 결과

연구 질문

RQ1인과 제약 하에서 온라인 AVIS가 오디오 히스토리를 현재 시각 입력과 효과적으로 융합할 수 있는가?
RQ2인스턴스 임베딩을 발화 상태에 민감하게 만들어 연관 중 소리 나는 객체와 무음 객체를 구별할 수 있는가?
RQ3오디오 가이드 대조 학습이 추적 중 무음 인스턴스를 억제함으로써 온라인 AVIS의 견고성을 향상시키는가?
RQ4온라인 CCAF와 AGCL이 AVISeg 성능과 실시간 추론 속도에 미치는 영향은?

주요 결과

SeaVIS는 FSLA, HOTA 및 mAP 지표에서 AVISeg의 최첨단 결과를 달성한다.
CCAF는 온라인 제약 하에 소리의 시간적 맥락을 다중 스케일 시각 특징에 효과적으로 통합하여 세그먼트 정확도를 개선한다.
AGCL은 프레임 수준 지표(FSLA를 주로)와 프레임 간 인스턴스 연합을 향상시키며 임베딩에 발화를 인코딩한다.
SeaVIS는 이전 온라인 방법들보다 정확도에서 앞서면서도 실용적인 FPS를 유지하는 등 실시간 성능이 경쟁력 있다.
ResNet-50 백본으로 AVISeg 벤치마크에서 SeaVIS는 47.09 FSLA, 66.47 HOTA, 46.28 mAP를 34.65 FPS에서 달성; Swin-L 백본으로는 54.65 FSLA, 73.85 HOTA, 54.29 mAP를 19.39 FPS에서 달성.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.