QUICK REVIEW

[论文解读] How Do Optical Flow and Textual Prompts Collaborate to Assist in Audio-Visual Semantic Segmentation?

Lee, Yujian, Gao, Peng|arXiv (Cornell University)|Jan 13, 2026

Speech and Audio Processing被引用 0

一句话总结

SSP 将基于光流的前掩码与两条文本提示以及视觉-文本对齐模块结合，以提升音视频语义分割，在 AVSS 基准上达到最先进的结果。

ABSTRACT

Audio-visual semantic segmentation (AVSS) represents an extension of the audio-visual segmentation (AVS) task, necessitating a semantic understanding of audio-visual scenes beyond merely identifying sound-emitting objects at the visual pixel level. Contrary to a previous methodology, by decomposing the AVSS task into two discrete subtasks by initially providing a prompted segmentation mask to facilitate subsequent semantic analysis, our approach innovates on this foundational strategy. We introduce a novel collaborative framework, extit{S}tepping extit{S}tone extit{P}lus (SSP), which integrates optical flow and textual prompts to assist the segmentation process. In scenarios where sound sources frequently coexist with moving objects, our pre-mask technique leverages optical flow to capture motion dynamics, providing essential temporal context for precise segmentation. To address the challenge posed by stationary sound-emitting objects, such as alarm clocks, SSP incorporates two specific textual prompts: one identifies the category of the sound-emitting object, and the other provides a broader description of the scene. Additionally, we implement a visual-textual alignment module (VTA) to facilitate cross-modal integration, delivering more coherent and contextually relevant semantic interpretations. Our training regimen involves a post-mask technique aimed at compelling the model to learn the diagram of the optical flow. Experimental results demonstrate that SSP outperforms existing AVS methods, delivering efficient and precise segmentation results.

研究动机与目标

通过利用运动线索与文本上下文来更好识别发声对象，从而推动 AVSS 的改进。
将 AVSS 拆解为前掩码阶段和语义分析阶段，以利用运动信息。
在分割过程中引入光流作为辅助提示以引导掩码生成。
结合两个文本提示和一个视觉-文本对齐模块来处理静止声源与跨模态整合。

提出的方法

提出将光流推导的掩码与真实掩码结合的前掩码技术，以在编码前 refined 分割。
使用由多模态大模型生成的两个文本提示来捕捉场景描述和潜在的静止声源。
实现基于 BERT 的视觉-文本对齐（VTA）模块，以在模态之间融合视觉与文本特征。
增加后掩码损失，强制模型在训练中学习超过 GT 掩码的动态与与声音相关的特征。
采用掩码、Dice、BCE 损失的联合训练目标，并加上辅助的 Lprime_mask 损失以提升泛化。

实验结果

研究问题

RQ1光流作为前掩码在结合语义提示时是否能改善 AVSS 分割？
RQ2双文本提示与 VTA 如何影响跨模态对齐与分割质量？
RQ3在推理时 GT 掩码不可用时，后掩码训练目标是否提升鲁棒性？
RQ4在 S4、MS3 与 AVSS 数据集上的 SSP 相对于最先进的 AVS/AVSS 模型的比较性能如何？

主要发现

Method	Audio-backbone	Visual-backbone	S4 mIoU	S4 F-score	MS3 mIoU	MS3 F-score	AVSS mIoU	AVSS F-score
AAVS [ 29 ]	VGGish	Swin-Base	83.2	91.3	67.3	77.6	48.5	53.2
SSP	VGGish	Swin-Base	85.4	93.3	72.3	84.6	50.1	54.5

SSP 在 S4 上相对于强基线 AVS（AAVS）提升 2.2% mIoU 与 1.9% F-score。
SSP 在 MS3 上相对于 AAVS 提升 5.0% mIoU 与 7.0% F-score。
SSP 在 AVSS 上相对于 AAVS 提升 1.6% mIoU 与 1.3% F-score。
视觉-文本对齐（VTA）模块在平均 mIoU 提升约 1.1%、F-score 提升约 0.5% 的效果优于替代方法。
消融实验显示前掩码结合光流可带来显著增益；将前掩码、后掩码与 VTA 结合可接近最先进结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。