QUICK REVIEW

[论文解读] Audiovisual SlowFast Networks for Video Recognition

Fanyi Xiao, Yong Jae Lee|arXiv (Cornell University)|Jan 23, 2020

Music and Audio Processing参考文献 86被引用 158

一句话总结

引入 Audiovisual SlowFast (AVSlowFast) 网络，将音频与 SlowFast 可视通路在多层次上融合，并结合 DropPathway 和音视频同步，以提升视频动作识别与自监督音视频特征。

ABSTRACT

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.

研究动机与目标

推动超越仅在后端对音频与视觉流进行融合的整合音视频感知。
开发一种在多个分层层次上将音频与 SlowFast 视觉通路融合的架构。
通过训练策略解决音频与视觉模态之间的异步学习动力学。
在多个动作分类和检测数据集上展示最先进的性能。
展示音视频表示在自监督学习中的泛化性。

提出的方法

在 SlowFast 基础上扩展一个专用的 Audio 通路，该通路处理 log-mel-spectrogram 输入。
通过在中间阶段将 Audio 与 Slow 与 Fast 视觉通路连接，引入层级化的音视频融合。
提出 DropPathway，通过在训练期间随机丢弃 Audio 通路来正则化联合训练。
实现音视频同步（AVS）作为辅助任务，以学习跨模态特征。
探索多种融合方案（A→F→S、A→FS，以及 Audiovisual Nonlocal），并评估它们对对齐和性能的影响。
提供关于融合阶段、侧连接与同步的消融研究，以理解设计权衡。

实验结果

研究问题

RQ1音频信息是否能够有效地整合到分层视觉表示中，以提升动作识别与检测？
RQ2哪些融合策略和训练技术能够在音频与视觉流之间实现最佳的学习动力学平衡？
RQ3分层音视频同步是否有助于学习模态通用表示，包括自监督特征？
RQ4在 SlowFast 增加 Audio 通路时，计算成本与准确率之间的权衡是多少？
RQ5相较于仅视觉模型，AVSlowFast 在多样化数据集（如第一视角、环境、以及标准基准）上的表现如何？

主要发现

AVSlowFast 在各数据集上持续提升 SlowFast 的性能，例如在 EPIC-Kitchens 上，音频在 20% 计算量下分别将动词/名词/动作的 top-1 准确率提升了 +2.9/+4.3/+2.3 点。
在 Kinetics 上，AVSlowFast 使用相同骨干网络时的 top-1 准确率高于 SlowFast，表明音频流在中等计算量下的有效性（约 10–20%）。
在 AVA 动作检测上，AVSlowFast 以相对较小的附加计算量（约 2% 总体）带来改进。
分层融合（在中间视觉阶段集成 Audio）优于晚期融合，当结合 res3、res4 和 pool5 连接时，多级融合达到峰值。
DropPathway 对稳定的联合训练至关重要，通过调节音视频学习节奏显著提升泛化能力。
音视频同步（AVS）进一步增强跨模态表征，并有益于自监督音视频特征学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。