QUICK REVIEW

[论文解读] Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu, Naixiang Zheng|arXiv (Cornell University)|Mar 9, 2026

Emotion and Mood Recognition被引用 0

一句话总结

The paper presents a dual-branch Transformer model that fuses visual and audio features with safe cross-attention and modality dropout to handle missing modalities, achieving 60.79% accuracy and 0.5029 F1 on Aff-Wild2 validation.

ABSTRACT

Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.

研究动机与目标

Address emotion recognition in-the-wild with occlusions and missing modalities.
Improve robustness to long-tail class distribution in Aff-Wild2 via focal loss.
Capture dynamic spatiotemporal dependencies with sliding-window soft voting.
Enable graceful degradation to audio-only predictions when visuals are unavailable.
Evaluate architecture choices and modality contributions on Aff-Wild2.

提出的方法

Two-stage visual and audio feature extraction using BEiT-large for visuals and WavLM-large for audio.
Dual-branch Transformer with cross-attention for inter-modality interaction and a learnable gating fusion mechanism.
Modality dropout during training and a safe attention mechanism to handle complete visual absence.
Focal loss to mitigate long-tail class imbalance, with invalid frames ignored in loss.
Inference with overlapping sliding windows and logit-based soft voting, followed by median filtering for temporal smoothing.

实验结果

研究问题

RQ1How can multimodal fusion be made robust to missing modalities in unconstrained facial expression recognition?
RQ2Does safe cross-attention with modality dropout improve performance under occlusion or visual dropouts?
RQ3Can focal loss and sliding-window inference mitigate long-tail and temporal jitter issues in Aff-Wild2?
RQ4What is the relative contribution of visual vs. audio modalities for expression recognition in-the-wild?
RQ5What architectural configurations balance performance and generalization on Aff-Wild2?

主要发现

The framework achieves 60.79% accuracy and 0.5029 F1 on the Aff-Wild2 validation set.
Visual features are the dominant modality, but audio provides complementary cues that improve fusion performance.
Modality dropout (p = 0.10) improves robustness and fault tolerance; higher p degrades performance.
Safe cross-attention enables graceful degradation to audio-only predictions when visuals are missing.
Sliding-window soft voting and median filtering reduce frame-level jitter and capture emotional transitions.
BEiT-large visual backbone yields the best validation performance among tested backbones (BEiT-large: Acc 0.5421, F1 0.4268).

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。