[论文解读] Conditional Flow Matching for Visually-Guided Acoustic Highlighting
论文将可视引导的声学高亮重新表述为一个条件流匹配生成任务,提出 rollout 损失以稳定多步轨迹,并提出改进的音视频条件模块以在 Muddy Mix 数据集上实现最先进的结果。
Visually-guided acoustic highlighting seeks to rebalance audio in alignment with the accompanying video, creating a coherent audio-visual experience. While visual saliency and enhancement have been widely studied, acoustic highlighting remains underexplored, often leading to misalignment between visual and auditory focus. Existing approaches use discriminative models, which struggle with the inherent ambiguity in audio remixing, where no natural one-to-one mapping exists between poorly-balanced and well-balanced audio mixes. To address this limitation, we reframe this task as a generative problem and introduce a Conditional Flow Matching (CFM) framework. A key challenge in iterative flow-based generation is that early prediction errors -- in selecting the correct source to enhance -- compound over steps and push trajectories off-manifold. To address this, we introduce a rollout loss that penalizes drift at the final step, encouraging self-correcting trajectories and stabilizing long-range flow integration. We further propose a conditioning module that fuses audio and visual cues before vector field regression, enabling explicit cross-modal source selection. Extensive quantitative and qualitative evaluations show that our method consistently surpasses the previous state-of-the-art discriminative approach, establishing that visually-guided audio remixing is best addressed through generative modeling.
研究动机与目标
- 将 Visually guided acoustic highlighting (VisAH) 重新表述为一个生成式、分布对分布的问题而非判别性问题。
- 开发 rollout 损失以缓解迭代流生成中的误差累积。
- 设计一个早期跨模态条件化机制,将音频特征融合到视觉编码器中,以改善源选择和回归。
提出的方法
- 采用条件流匹配,在视频线索条件下将音频分布从不平衡向平衡分布传输。
- 引入 rollout 损失,在较短的流步骤后对完整预测轨迹进行监督以防止漂移。
- 加入一个音频使能的条件适配器,将音频特征注入基于 CLIP 的视觉编码器,从而实现早期跨模态融合。
- 对流步骤使用正弦时间条件,并通过端到端反向传播训练整个流。

实验结果
研究问题
- RQ1可视引导的声学高亮是否更适合用生成式的流模型而非判别映射来处理?
- RQ2引入 rollout 损失是否能在视觉引导的音频混音中稳定多步流整合?
- RQ3将音频特征在早期与视觉条件融合是否能提升源选择和混音质量?
主要发现
| Model | IB Score | KLD | LDif | Mag | Env | Was |
|---|---|---|---|---|---|---|
| Input | 28.14 | 20.74 | 18.36 | 22.69 | 6.29 | 1.96 |
| VisAH | CLIP | 28.84 | 11.37 | 9.66 | 9.99 | 3.38 | 0.84 |
| VisAH | T5 | 28.92 | 11.71 | 9.63 | 10.22 | 3.44 | 0.88 |
| VisAH-FM (Ours) | CLIP-CLAP | 29.12 | 9.70 | 7.77 | 8.28 | 2.74 | 0.63 |
- 带 rollout 损失的 CFM 在多项指标上优于以往的 VisAH 判别模型。
- 将音频特征注入视觉编码器的调节模块相比仅视觉条件化带来显著提升。
- rollout 损失稳定轨迹并降低漂移,提高与真值的长距离轨迹对齐。
- 消融实验表明音频使能条件化(CLAP)比仅文本条件对该任务更有利。
- 主观测试表明 VisAH-FM 在将音频与视觉场景对齐方面优于 VisAH。
- 在 Muddy Mix 数据集上的定量结果显示相较基线,在 IB Score、KLD、LDif、Mag、Env、Was 指标上均有提升。

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。