QUICK REVIEW

[論文レビュー] MEGC2026: Micro-Expression Grand Challenge on Visual Question Answering

Xinqi Fan, Jingting Li|arXiv (Cornell University)|Mar 9, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

MEGC2026は、マルチモーダルLMMを活用した短クラス-および長クラスの動画VQAタスクを導入します；ベースライン結果は、微細表情の理解と時間的推論における大きな課題を明らかにします。

ABSTRACT

Facial micro-expressions (MEs) are involuntary movements of the face that occur spontaneously when a person experiences an emotion but attempts to suppress or repress the facial expression, typically found in a high-stakes environment. In recent years, substantial advancements have been made in the areas of ME recognition, spotting, and generation. The emergence of multimodal large language models (MLLMs) and large vision-language models (LVLMs) offers promising new avenues for enhancing ME analysis through their powerful multimodal reasoning capabilities. The ME grand challenge (MEGC) 2026 introduces two tasks that reflect these evolving research directions: (1) ME video question answering (ME-VQA), which explores ME understanding through visual question answering on relatively short video sequences, leveraging MLLMs or LVLMs to address diverse question types related to MEs; and (2) ME long-video question answering (ME-LVQA), which extends VQA to long-duration video sequences in realistic settings, requiring models to handle temporal reasoning and subtle micro-expression detection across extended time periods. All participating algorithms are required to submit their results on a public leaderboard. More details are available at https://megc2026.github.io.

研究の動機と目的

視覚質問応答を用いたマルチモーダル推論による微表情分析の動機づけ。
短時間動画のME-VQAおよび長時間動画のME-LVQAタスクを追加し、時系列推論を検証。
ME-VQAおよびME-LVQAデータセットとベースラインを提供し、VLM/LVLMアプローチのベンチマークを実施。
ME文脈における粗い・細粒度の感情理解と言語生成品質を評価。

提案手法

フレーム単位およびビデオレベルのME手掛かりを自然言語質問と組み合わせてVQAを実施。
二つのタスク：ME-VQA（短いクリップ）とME-LVQA（長い動画）。
ベースラインはQwen2.5VL-3BおよびQwen3VL-4Bをゼロショットおよびファインチューニング設定で評価。
ファインチューニングは視覚エンコーダおよび多モーダル投影層に対するQLoRAアダプタを使用。
評価指標は感情分類にUF1/UAR、言語回答にBLEU/ROUGEを使用。

Figure 1: An overview of Micro-Expression Grand Challenges (MEGCs).

実験結果

リサーチクエスチョン

RQ1LVLM/LVLMは短い動画のQA設定で粗いME理解と細粒度ME理解の両方を行えるか。
RQ2長時間の自然動画における時系列推論とME検出をモデルはどの程度扱えるか。
RQ3MEカウント、AU認識、ME文脈での回答品質のファインチューニングの限界は何か。
RQ4ファインチューニングはME認識精度より言語生成品質をより改善するか。

主な発見

ゼロショットの粗いME-VQAは中程度の性能（UF1/UAR約0.24–0.33）だが、ほとんどのケースで細粒度ME認識はほぼゼロ。
ファインチューニングは、特に粗粒度指標でME-VQAのCAS(ME)3において控えめな改善をもたらす。
言語品質指標（BLEU/ROUGE）は、一部モデルでファインチューニングによりME分類指標よりも大きく改善。
ME-LVQAの結果はエラー率が大幅に高く、長時間動画の時系列定位と細粒度AUモデリングに重大な課題があることを示す。
限定的なファインチューニング（対象10名）は未知の個人への一般化を低下させる可能性があり、長時間動画のME理解にロバスト性の問題を示唆。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。