QUICK REVIEW

[論文レビュー] Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace

Dimitrios Kollias, Stefanos Zafeiriou|arXiv (Cornell University)|Sep 25, 2019

Emotion and Mood Recognition参考文献 45被引用数 90

ひとこと要約

この論文は Aff-Wild2 を紹介し、野外での視聴覚データベースで valence-arousal、action units、基本表情に注釈があり、ArcFace 損失を用いたマルチタスク CNN/CNN-RNN モデルが10 の感情データベースで最先端の結果を達成する。

ABSTRACT

Affective computing has been largely limited in terms of available data resources. The need to collect and annotate diverse in-the-wild datasets has become apparent with the rise of deep learning models, as the default approach to address any computer vision task. Some in-the-wild databases have been recently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence-arousal estimation, action unit detection and basic expression classification). To address these, we substantially extend the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal. Furthermore, we annotate parts of the database with basic expressions and action units. As a consequence, for the first time, this allows the joint study of all three types of behavior states. We call this database Aff-Wild2. We conduct extensive experiments with CNN and CNN-RNN architectures that use visual and audio modalities; these networks are trained on Aff-Wild2 and their performance is then evaluated on 10 publicly available emotion databases. We show that the networks achieve state-of-the-art performance for the emotion recognition tasks. Additionally, we adapt the ArcFace loss function in the emotion recognition context and use it for training two new networks on Aff-Wild2 and then re-train them in a variety of diverse expression recognition databases. The networks are shown to improve the existing state-of-the-art. The database, emotion recognition models and source code are available at http://ibug.doc.ic.ac.uk/resources/aff-wild2.

研究の動機と目的

Aff-Wild を連続 VA 注釈で拡張し AU と基本表情ラベルを追加して Aff-Wild2 を作成する。
VA, AU, Expr タスクを単一フレームワークで共同研究・学習を可能にする。
Aff-Wild2 を活用して頑健な感情認識を目指すマルチタスク CNN/CNN-RNN アーキテクチャ（視覚、音声、音声視覚モダリティを含む）を開発する。
ArcFace 損失を感情認識文脈に適用して識別力を向上させ、最先端の結果を推し進める。

提案手法

Aff-Wild2 を Aff-Wild と 260 本の新規 YouTube 動画を統合して構築し、総計 2,786,201 フレームと 458 名 subject。
4 人の専門家による VA 注釈と平均値 valence/arousal を算出; 可能な限り高いインタ annotator 一致を確保。
Aff-Wild2 のサブセットに 8 AUs（1,2,4,6,12,15,20,25）と dedicated videos で7基本表情を注釈。
VA, AU, Expr タスク用に CNN ベースのモデル（MT-VGG, MT-VGG-RNN, A/V-MT-VGG-RNN）を事前学習・微調整; 動的には時系列 GRU を使用。
表情認識のために ArcFace ベースのネットワーク（MT-ArcRes, MT-ArcVGG）を導入し、複数データベースで再訓練してクロスデータベース性能を評価。
ArcFace 損失を用いて角度マージンの埋め込みを作成し、表情予測の中心割当をコサイン類似度ベースで評価。

実験結果

リサーチクエスチョン

RQ1Aff-Wild2 は VA, AU, Expr の共同注釈をマルチタスク学習に効果的にサポートできるか。
RQ2Aff-Wild2 で訓練されたマルチタスク CNN/CNN-RNN アーキテクチャは他の野外感情データベースに一般化できるか。
RQ3ArcFace 損失は感情認識ベンチマークで表情認識性能を向上させるか。
RQ4視聴覚融合（A/V）は視覚のみモデルと比べて VA, AU, Expr 認識にどのような影響を与えるか。

主な発見

Aff-Wild2 は Aff-Wild に加えて新データとして 1.413 百万フレームの VA 注釈を含み、総計 2.786 百万フレーム、558 ビデオ、458 名の被験者。
AU 注釈は 8 AU を 397,800 フレーム、Expr 注釈は 7 基本表情で 403,758 フレームをカバー。
マルチタスク MT-VGG および MT-VGG-RNN は Aff-Wild2 で訓練され、10 個の感情データベースで最先端の性能を達成、例外はごく一部の Expr 特定データセットのみ。
音声/映像融合（A/V）は一般に視覚のみモデルより VA で性能を改善し、A/V-MT-VGG-RNN は MT-VGG-RNN を上回り、ArcFace ベースのネットワークは表情認識で従来手法を凌駕。
ArcFace ベースのネットワーク（MT-ArcRes, MT-ArcVGG）は Aff-Wild2 で訓練され、様々なデータベースで再訓練して、いくつかの既存手法を上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。