QUICK REVIEW

[論文レビュー] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina, Alexandr Axyonov|arXiv (Cornell University)|Mar 13, 2026

Emotion and Mood Recognition被引用数 0

ひとこと要約

論文は、動画レベルのアンビブレンス/ヘシテンス認識のためのプロトタイプ拡張分類を備えた四モダリティ融合法を提示し、アンサンブルにより83.25%の平均MF1と71.43%の最終テスト MF1を達成。

ABSTRACT

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

研究の動機と目的

制約のない動画におけるアンビブレンス/ヘシテンス認識を動機づけ、微妙なマルチモーダル動作状態を扱う。
統合融合のためにコンパクトなユニモーダル埋め込みを学習する四モダリティパイプライン（シーン、顔、音声、テキスト）を開発する。
モーダリティ間の依存関係をモデル化するため、プロトタイプ拡張目的を含むトランスフォーマーベースの融合を探究する。
BAHコーパス上でマルチモーダル融合がユニモーダルベースラインを上回ることを示し、アンサンブルを通じてロバスト性を示す。

提案手法

VideoMAEベースの視覚モデルでシーンのダイナミクスを抽出。
AffectNetで微調整したEfficientNetB0によるフレームレベルの感情埋め込みを統計的にプーリングして顔情報をエンコードし、MLPへ供給。
EmotionWav2Vec2.0で音響感情特徴を抽出し、MambaまたはTransformerエンコーダで時系列モデリング後にプーリング。
転写のテキストモデル（EmotionDistilRoBERTa、EmotionTextClassifier 等）を微調整して言語的手掛かりをモデル化し、密なテキスト埋め込みを得る。
モダリティトークンを用いたトランスフォーマー型のマルチモーダルモジュールでユニモーダル埋め込みを融合し、プロトタイプベースの分類目的を導入、欠損データのモダリティマスクも含める。
二段階システムを学習する：各モダリティのユニモーダルエンコーダ、次に共有潜在フュージョンを行い、必要に応じてプロトタイプと多様性正則化を損失に付与する。

実験結果

リサーチクエスチョン

RQ1シーン、顔、音声、テキストの補完的手掛かりを活用して、動画レベルで堅牢なアンビブレンス/ヘシテンス認識を達成できるか。
RQ2プロトタイプ拡張融合は標準的な融合より識別性と一般化を向上させるか。
RQ3各モダリティが最終性能に与える寄与はどれか、モダリティ融合はユニモーダルベースラインとどう比較されるか。
RQ4ABA W10 A/Hチャレンジの未公開のプライベートテストデータに対してアンサンブル融合性能は頑健か。

主な発見

マルチモーダル融合は開発時とテスト時のすべてのユニモーダルベースラインを上回る。
最良のユニモーダル平均MF1：EmotionDistilRoBERTa = 70.02%；最良の融合平均MF1：プロトタイプ拡張四モダリティモデル = 83.25%。
最終テストMF1のピークは、5つのプロトタイプ拡張融合モデルのアンサンブルで達成：71.43%。
アブレーションにより、シーンとテキストの組み合わせから最も大きな改善が見られ、4つのモダリティすべてを用いると全体として最良の結果となる。
プロトタイプ拡張融合は補助信号を提供し最終予測を強化し、プライベートテスト分割での一般化をアンサンブルが向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。