QUICK REVIEW

[論文レビュー] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Lee, Yujian, Gao, Peng|arXiv (Cornell University)|Jan 13, 2026

Speech and Audio Processing被引用数 0

ひとこと要約

SSPは光フローを用いた事前マスクを2つのテキストプロンプトと視覚-文本整合モジュールと組み合わせ、AVSSベンチマークで最先端の結果を達成する。

ABSTRACT

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, extit{S}tepping extit{S}tone extit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

研究の動機と目的

運動情報とテキスト文脈を活用して音を出す物体をより識別可能にすることによりAVSSの改善を動機づける。
AVSSを事前マスク段と意味解析段に分解し運動情報を活用する。
セグメンテーション中にマスク生成を導く補助プロンプトとして光フローを導入する。
stationaryな音源への対応と跨模態統合のため、二つのテキストプロンプトと視覚-文本整合モジュールを組み込む。）

提案手法

光フロー由来のマスクをグラウンドトゥルースマスクと組み合わせてエンコード前のセグメンテーションを改善する事前マスク技術を提案する。
シーン説明と潜在的な静止音源を捉えるため、マルチモーダルLLMにより生成された二つのテキストプロンプトを使用する。
視覚と文本特徴をモダリティ間で統合するBERTベースのVisual-Textual Alignment（VTA）モジュールを実装する。
トレーニング時にGTマスクを越えた運動・音関連特徴を学習させるため、ポストマスク損失を追加する。
マスク損失、Dice損失、BCE損失に加え、一般化を向上させる補助的なLprime_mask損失を含むジョイント学習目標を採用する。）

実験結果

リサーチクエスチョン

RQ1前処理としての光フローがセマンティックプロンプトと組み合わせたとき、AVSSセグメンテーションを改善できるか？
RQ2二重テキストプロンプトとVTAは跨模態整合性とセグメンテーション品質にどのように影響するか？
RQ3推論時にGTマスクが利用できない場合、ポストマスク学習目的は頑健性を高めるか？
RQ4SSPはS4、MS3、AVSSデータセット上で最先端のAVS/AVSSモデルと比較してどの程度性能を示すか？

主な発見

方法	音声バックボーン	視覚バックボーン	S4 mIoU	S4 F-score	MS3 mIoU	MS3 F-score	AVSS mIoU	AVSS F-score
AAVS [ 29 ]	VGGish	Swin-Base	83.2	91.3	67.3	77.6	48.5	53.2
SSP	VGGish	Swin-Base	85.4	93.3	72.3	84.6	50.1	54.5

SSPはS4で強力なAVSベースライン（AAVS）を2.2%のmIoUと1.9%のF-scoreで上回る。
SSPはMS3でAAVSを5.0%のmIoUと7.0%のF-scoreで上回る。
SSPはAVSSでAAVSを1.6%のmIoUと1.3%のF-scoreで上回る。
Visual-Textual Alignment（VTA）モジュールは代替手法に対して平均約1.1%のmIoU、0.5%のF-scoreの向上をもたらす。
アブレーションにより光フローを用いた事前マスクが顕著な利得を提供し、事前マスクとポストマスクおよびVTAを組み合わせると最先端手法に近づく。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。