QUICK REVIEW

[논문 리뷰] Team LEYA in 10th ABAW Competition: Multimodal Ambivalence/Hesitancy Recognition Approach

Elena Ryumina, Alexandr Axyonov|arXiv (Cornell University)|2026. 03. 13.

Emotion and Mood Recognition인용 수 0

한 줄 요약

논문은 프로토타입 확장 분류를 갖춘 네 모달리티(장면, 얼굴, 음향, 텍스트) 다중 모달 융합 방법을 제시하여 비디오 수준의 애매함/주저 인식을 달성하고, 앙상블을 통해 평균 MF1 83.25% 및 최종 테스트 MF1 71.43%를 달성합니다.

ABSTRACT

Ambivalence/hesitancy recognition in unconstrained videos is a challenging problem due to the subtle, multimodal, and context-dependent nature of this behavioral state. In this paper, a multimodal approach for video-level ambivalence/hesitancy recognition is presented for the 10th ABAW Competition. The proposed approach integrates four complementary modalities: scene, face, audio, and text. Scene dynamics are captured with a VideoMAE-based model, facial information is encoded through emotional frame-level embeddings aggregated by statistical pooling, acoustic representations are extracted with EmotionWav2Vec2.0 and processed by a Mamba-based temporal encoder, and linguistic cues are modeled using fine-tuned transformer-based text models. The resulting unimodal embeddings are further combined using multimodal fusion models, including prototype-augmented variants. Experiments on the BAH corpus demonstrate clear gains of multimodal fusion over all unimodal baselines. The best unimodal configuration achieved an average MF1 of 70.02%, whereas the best multimodal fusion model reached 83.25%. The highest final test performance, 71.43%, was obtained by an ensemble of five prototype-augmented fusion models. The obtained results highlight the importance of complementary multimodal cues and robust fusion strategies for ambivalence/hesitancy recognition.

연구 동기 및 목표

무제한 영상에서의 애매함/주저 인식이라는 미묘하고 다중 모달인 행동 상태를 동기 부여하고 다룬다.
융합을 위한 컴팩트한 단일 모달 임베딩을 학습하기 위한 네 모달리티 파이프라인(씬, 얼굴, 음향, 텍스트)을 개발한다.
프로토타입 확장 목표를 사용한 트랜스포머 기반 융합을 탐구하여 모달리티 간 의존성을 모델링한다.
다중 모달 융합이 BAH 코퍼스에서 단일 모달 기초선보다 우수함을 입증하고 앙상블을 통해 일반화의 강건성을 보인다.

제안 방법

VideoMAE 기반 시각 모델로 씬 다이나믹을 추출한다.
AffectNet으로 미세조정된 EfficientNetB0의 프레임 단위 감정 임베딩을 사용하고 통계적으로 풀링하여 MLP로 전달한다.
EmotionWav2Vec2.0으로 음향 감정 특징을 추출하고 Mamba 또는 Transformer 인코더로 시계열을 모델링한 뒤 풀링한다.
전사록에 대해 트랜스포머 기반 텍스트 모델(EmotionDistilRoBERTa, EmotionTextClassifier 등)을 미세조정하여 밀집 텍스트 임베딩을 얻는다.
모듈 토큰과 프로토타입 기반 분류 객체를 사용하는 트랜스포머 기반 다중 모달 모듈로 단일 모달 임베딩을 융합하고, 데이터 누락에 대한 모듀얼 마스크를 포함한다.
각 모달리티의 인코더를 먼저 학습시키는 두 단계 시스템, 그 후 공유 잠재 융합을 학습하고, 필요 시 프로토타입 및 다양성 규제 손실을 추가한다.

실험 결과

연구 질문

RQ1씬, 얼굴, 음향, 텍스트의 보완적 신호를 활용함으로써 강건한 비디오 수준의 애매함/주저 인식을 달성할 수 있는가?
RQ2프로토타입 확장 융합이 표준 융합에 비해 구별력과 일반화를 향상시키는가?
RQ3각 모달리티가 최종 성능에 기여하는 바는 무엇이며 모달리티 융합은 단일 모달 기초선과 어떻게 비교되는가?
RQ4ABA W10 A/H 도전 과제의 미지의 비공개 테스트 데이터에 대해 앙상블 융합 성능이 강건한가?

주요 결과

Model Configuration	BAH sub-corpus	Modality	Features	Classifier	Devel. / Valid. (MF1, %)	Test (MF1, %)	Average (MF1, %)	Final test (MF1, %)
Face	1	Face	EmotionEfficientNetB0 + Statistical Features	MLP	65.29	60.05	62.67	–
Scene	2	Scene	VideoMAE	Linear Layer	61.71	62.21	61.96	–
Audio	3	Audio	EmotionWav2Vec2.0 + Mamba	Linear Layer	67.20	70.87	69.03	–
Text	4	Text	TF-IDF	Logistic Regression	68.30	67.75	68.03	–
Text	5	Text	TF-IDF	CatBoost	65.56	72.02	68.79	–
Text	6	Text	Fine-tuned EmotionTextClassifier	MLP	69.28	70.72	70.00	–
Text	7	Text	Fine-tuned EmotionDistilRoBERTa	MLP	68.54	71.49	70.02	–
Multimodal	8	Models IDs 1, 2, 3 and 4	Multimodal Fusion Model	Linear Layer	80.79	77.03	78.91	–
Multimodal	9	Models IDs 1, 2, 3 and 5	Multimodal Fusion Model	Linear Layer	77.91	78.54	78.22	–
Multimodal	10	Models IDs 1, 2, 3 and 6	Multimodal Fusion Model	Linear Layer	78.35	77.03	77.69	–
Multimodal	11	Models IDs 1, 2, 3 and 7	Multimodal Fusion Model	Linear Layer	85.38	79.94	82.66	68.32
Multimodal	12	Models IDs 1, 2, 3 and 7	Multimodal Fusion Model with Prototype Head	Linear Layer	83.79	82.72	83.25	65.21
Multimodal	13	Models IDs 1, 2, 3 and 7	Ensemble of Five Multimodal Fusion Models	Linear Layer	81.94	80.64	81.29	70.17
Multimodal	14	Models IDs 1, 2, 3 and 7	Ensemble of Five Multimodal Fusion Models with Prototype Head	Linear Layer	83.00	80.77	81.89	71.43

다중 모달 융합이 개발 및 테스트 설정 전반에서 모든 단일 모달 기초선을 능가한다.
최고의 단일 모달 평균 MF1: EmotionDistilRoBERTa 70.02%; 최고의 융합 평균 MF1: 프로토타입 확장된 네 모달 모델 83.25%.
최종 테스트 MF1 피크는 다섯 개의 프로토타입 확장 융합 모델의 앙상블로 달성: 71.43%
절단 분석에서 씬과 텍스트를 결합하는 것이 가장 큰 이득을 주고, 네 가지 모달리티가 모두 결합될 때 최상의 총 결과를 제공한다.
프로토타입 확장 융합은 최종 예측을 향상시키는 보조 신호를 제공하며, 앙상블이 비공개 테스트 분할에서 일반화를 높인다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.