QUICK REVIEW

[Paper Review] Audiovisual SlowFast Networks for Video Recognition

Fanyi Xiao, Yong Jae Lee|arXiv (Cornell University)|Jan 23, 2020

Music and Audio Processing86 references158 citations

TL;DR

Introduces Audiovisual SlowFast (AVSlowFast) networks that fuse audio with SlowFast visual pathways across multiple layers, along with DropPathway and audiovisual synchronization to improve video action recognition and self-supervised audiovisual features.

ABSTRACT

We present Audiovisual SlowFast Networks, an architecture for integrated audiovisual perception. AVSlowFast has Slow and Fast visual pathways that are deeply integrated with a Faster Audio pathway to model vision and sound in a unified representation. We fuse audio and visual features at multiple layers, enabling audio to contribute to the formation of hierarchical audiovisual concepts. To overcome training difficulties that arise from different learning dynamics for audio and visual modalities, we introduce DropPathway, which randomly drops the Audio pathway during training as an effective regularization technique. Inspired by prior studies in neuroscience, we perform hierarchical audiovisual synchronization to learn joint audiovisual features. We report state-of-the-art results on six video action classification and detection datasets, perform detailed ablation studies, and show the generalization of AVSlowFast to learn self-supervised audiovisual features. Code will be made available at: https://github.com/facebookresearch/SlowFast.

Motivation & Objective

Motivate integrated audiovisual perception beyond late fusion of audio and visual streams.
Develop an architecture that fuses audio with SlowFast visual pathways at multiple hierarchical levels.
Address asynchronous learning dynamics between audio and visual modalities with training strategies.
Demonstrate state-of-the-art performance on multiple action classification and detection datasets.
Show generalization of the audiovisual representation to self-supervised learning.

Proposed method

Extend SlowFast with a dedicated Audio pathway that processes log-mel-spectrogram inputs.
Introduce hierarchical audiovisual fusion by connecting Audio with Slow and Fast visual pathways at intermediate stages.
Propose DropPathway to regularize joint training by randomly dropping the Audio pathway during training.
Implement audiovisual synchronization (AVS) as an auxiliary task to learn cross-modal features.
Explore multiple fusion schemes (A→F→S, A→FS, and Audiovisual Nonlocal) and evaluate their impact on alignment and performance.
Provide ablations on fusion stages, lateral connections, and synchronization to understand design trade-offs.

Experimental results

Research questions

RQ1Can audio information be effectively integrated into hierarchical visual representations to improve action recognition and detection?
RQ2What fusion strategies and training techniques best balance learning dynamics between audio and visual streams?
RQ3Does hierarchical audiovisual synchronization help learn modality-general representations, including self-supervised features?
RQ4What is the computational cost and accuracy trade-off when adding an Audio pathway to SlowFast?
RQ5How does AVSlowFast perform across diverse datasets (egocentric, ambient, and standard benchmarks) compared to visual-only models?

Key findings

AVSlowFast consistently improves SlowFast across datasets, e.g., on EPIC-Kitchens, audio boosts top-1 accuracy for verb/noun/action by +2.9/+4.3/+2.3 points at 20% compute.
On Kinetics, AVSlowFast achieves higher top-1 accuracy than SlowFast with the same backbone, demonstrating effectiveness of the audio stream at modest compute (~10–20%).
On AVA action detection, AVSlowFast yields improvements with relatively small additional compute (~2% overall).
Hierarchical fusion (Audio integrated at intermediate visual stages) outperforms late fusion, with multi-level fusion peaking when incorporating res3, res4, and pool5 connections.
DropPathway is essential for stable joint training, significantly improving generalization by regulating audio-visual learning pace.
Audio-visual synchronization (AVS) further enhances cross-modal representations and benefits self-supervised audiovisual feature learning.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.