QUICK REVIEW

[論文レビュー] Expression, Affect, Action Unit Recognition: Aff-Wild2, Multi-Task Learning and ArcFace

Stefanos Zafeiriou, Stefanos Zafeiriou|arXiv (Cornell University)|Jun 10, 2019

Emotion and Mood Recognition被引用数 122

ひとこと要約

著者らは、Aff-Wild2 を導入し、価値 (valence/arousal)、表情ユニット (AUs)、および基本表情の注釈付き野外大規模視聴覚データセットを提示し、マルチタスクおよび ArcFace ベースの学習パイプラインが複数の感情認識データベースにおいて最先端の結果を達成することを示す。

ABSTRACT

Affective computing has been largely limited in terms of available data resources. The need to collect and annotate diverse in-the-wild datasets has become apparent with the rise of deep learning models, as the default approach to address any computer vision task. Some in-the-wild databases have been recently proposed. However: i) their size is small, ii) they are not audiovisual, iii) only a small part is manually annotated, iv) they contain a small number of subjects, or v) they are not annotated for all main behavior tasks (valence-arousal estimation, action unit detection and basic expression classification). To address these, we substantially extend the largest available in-the-wild database (Aff-Wild) to study continuous emotions such as valence and arousal. Furthermore, we annotate parts of the database with basic expressions and action units. As a consequence, for the first time, this allows the joint study of all three types of behavior states. We call this database Aff-Wild2. We conduct extensive experiments with CNN and CNN-RNN architectures that use visual and audio modalities; these networks are trained on Aff-Wild2 and their performance is then evaluated on 10 publicly available emotion databases. We show that the networks achieve state-of-the-art performance for the emotion recognition tasks. Additionally, we adapt the ArcFace loss function in the emotion recognition context and use it for training two new networks on Aff-Wild2 and then re-train them in a variety of diverse expression recognition databases. The networks are shown to improve the existing state-of-the-art. The database, emotion recognition models and source code are available at http://ibug.doc.ic.ac.uk/resources/aff-wild2.

研究の動機と目的

大規模で多様な野外データセットを、VA、AU、表情の注釈つきで必要性を動機付ける。
Aff-Wild2 へ Aff-Wild を拡張し、VA 注釈と AU/Expr 注釈を追加して三つのタスクの共同分析を可能にする。
Aff-Wild2 で訓練されたマルチタスク CNN/CNN-RNN アーキテクチャを開発し、10 の外部データベースで評価する。
ArcFace 損失を感情表現認識に適用し、Aff-Wild2 上で ArcFace ベースのネットワークを訓練し、様々な表情データベースで再訓練して効果を検証する。

提案手法

視覚モダリティ（顔クロップ）と音声モダリティ（スペクトログラム）のための3つの前処理ストリームを導入する。
SphereFace-20、VGGFace、Inception-ResNet に基づく単一/マルチタスク CNN を訓練し、マルチタスク CNN-RNN および音声映像融合（A/V-MT-VGG-RNN）へ拡張する。
マルチタスク学習の標準損失を使用する：表情にはクロスエントロピー、AU にはバイナリクロスエントロピー、VA には MSE/CCC を用い、これらを多タスク目的関数として合計する。
ArcFace 損失（加法的角度マージン）を感情表現認識へ適用し、MT-ArcRes および MT-ArcVGG ネットワークを作成する。
Aff-Wild2 で事前訓練し、10 の公開データベースで評価してクロスデータベース一般化を評価する。
Aff-Wild2 で訓練され、複数の表情データベースで再訓練された ArcFace ベースの2つのネットワークを提供し、最先端の結果を改善する。

実験結果

リサーチクエスチョン

RQ1Aff-Wild2 は野外での VA、AU、Expr の共同認識をサポートできるか。
RQ2Aff-Wild2 で訓練されたマルチタスク CNN/CNN-RNN アーキテクチャは他の感情データベースへ一般化できるか。
RQ3野外設定で VA、AU、Expr タスクに対して音声映像融合は有益か。
RQ4ArcFace 損失は顔認識から感情タスクへ適用した場合、表現認識性能を向上させるか。

主な発見

Aff-Wild2 は VA、AU、および基本表情のために注釈付きの野外大規模視聴覚データセットとして初めてのものであり、三つのタスクの共同分析を可能にする。
Aff-Wild2 で訓練された MT-VGG および MT-VGG-RNN アーキテクチャは、VA および Expr タスクにおいて 10 個の外部感情データベースで最先端の性能を達成し、音声映像融合はさらなる改善を提供する。
Aff-Wild2 で訓練され、さまざまな表情データベースで再訓練された ArcFace ベースのネットワーク（MT-ArcRes、MT-ArcVGG）は競合手法を上回り、いくつかのデータベースで新しい最先端の結果を確立する。
静的データベースとビデオデータベースを横断したクロスデータベース評価は、Aff-Wild2 が堅牢な感情認識モデルの豊かな事前訓練リソースであることを示す。
ArcFace 損失は感情認識の文脈で有効性を示し、角度マージンアプローチの価値を顔識別以外のタスクにも示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。