QUICK REVIEW

[論文レビュー] Solution to the 10th ABAW Expression Recognition Challenge: A Robust Multimodal Framework with Safe Cross-Attention and Modality Dropout

Jun Yu, Naixiang Zheng|arXiv (Cornell University)|Mar 9, 2026

Emotion and Mood Recognition被引用数 0

ひとこと要約

この論文は、欠損 modality に対応する safe cross-attention と modality dropout を用いて視覚・音声特徴を統合するデュアルブラン Transformer モデルを提案し、Aff-Wild2 の検証データで 60.79% の精度と 0.5029 の F1 を達成する。

ABSTRACT

Emotion recognition in real-world environments is hindered by partial occlusions, missing modalities, and severe class imbalance. To address these issues, particularly for the Affective Behavior Analysis in-the-wild (ABAW) Expression challenge, we propose a multimodal framework that dynamically fuses visual and audio representations. Our approach uses a dual-branch Transformer architecture featuring a safe cross-attention mechanism and a modality dropout strategy. This design allows the network to rely on audio-based predictions when visual cues are absent. To mitigate the long-tail distribution of the Aff-Wild2 dataset, we apply focal loss optimization, combined with a sliding-window soft voting strategy to capture dynamic emotional transitions and reduce frame-level classification jitter. Experiments demonstrate that our framework effectively handles missing modalities and complex spatiotemporal dependencies, achieving an accuracy of 60.79% and an F1-score of 0.5029 on the Aff-Wild2 validation set.

研究の動機と目的

現場の表情認識を、遮蔽や欠損モダリティを伴う状況で検討する。
Aff-Wild2 における長尾分布に対する堅牢性を焦点損失で向上させる。
スライディングウィンドウのソフトボ voting による動的時空依存性を捉える。
視覚情報が利用できない場合の音声のみ予測へと穏やかな劣化を実現する。
Aff-Wild2 におけるアーキテクチャ選択とモダリティ貢献を評価する。

提案手法

視覚には BEiT-large、音声には WavLM-large を用いた二段階の視覚・音声特徴抽出。
モダリティ間相互作用のためのクロスアテンションと学習可能なゲーティング融合機構を備えたデュアルブラン Transformer。
訓練時のモダリティドロップアウトと、完全な視覚欠如時の安全なアテンション機構。
長尾クラスの不均衡を緩和する focal loss、損失では無効フレームを無視。
推論は重複するスライディングウィンドウとロジットベースのソフトボ voting、時系列平滑化のための中央値フィルタリング。

実験結果

リサーチクエスチョン

RQ1未拘束な表情認識において欠搽モダリティに対してマルチモーダル融合をどのように頑健にするか。
RQ2欠落や遮蔽時に safe cross-attention とモダリティドロップアウトは性能を向上させるか。
RQ3Aff-Wild2 における長尾と時系列の揺らぎ問題を focal loss とスライディングウィンドウ推論で軽減できるか。
RQ4視覚対音声の寄与度は野外表情認識でどの程度か。
RQ5Aff-Wild2 で性能と一般化をどのアーキテクチャ設定がバランスするか。

主な発見

フレームワークは Aff-Wild2 の検証セットで 60.79% の精度と 0.5029 の F1 を達成。
視覚特徴が主要モダリティだが、音声は補完的手掛かりを提供し融合性能を改善。
モダリティドロップアウト（p = 0.10）は堅牢性と障害耐性を向上させる；より高い p は性能を劣化させる。
安全なクロスアテンションにより、視覚が欠如している場合に音声のみ prediction への穏やかな劣化を可能に。
スライディングウィンドウのソフトボ voting と中央値フィルタリングはフレームレベルの揺らぎを低減し、感情の移行を捉える。
BEiT-large の視覚バックボーンが検証性能で最良を示す（BEiT-large: Acc 0.5421, F1 0.4268）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。