QUICK REVIEW

[論文レビュー] Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning

Samarth Tripathi, Tripathi, Sarthak|arXiv (Cornell University)|Apr 16, 2018

Emotion and Mood Recognition参考文献 20被引用数 69

ひとこと要約

本論文は、音声、テキスト、モーションキャプチャデータを用いたIEMOCAP向けのモジュラーな多模態感情認識システムを提案し、最終層でモダリティ別モデルを融合し、ハイパーパラメータをOpenOPTツールで調整する。

ABSTRACT

Emotion recognition has become an important field of research in Human Computer Interactions as we improve upon the techniques for modelling the various aspects of behaviour. With the advancement of technology our understanding of emotions are advancing, there is a growing need for automatic emotion recognition systems. One of the directions the research is heading is the use of Neural Networks which are adept at estimating complex functions that depend on a large number and diverse source of input data. In this paper we attempt to exploit this effectiveness of Neural networks to enable us to perform multimodal Emotion recognition on IEMOCAP dataset using data from Speech, Text, and Motion capture data from face expressions, rotation and hand movements. Prior research has concentrated on Emotion detection from Speech on the IEMOCAP dataset, but our approach is the first that uses the multiple modes of data offered by IEMOCAP for a more robust and accurate emotion detection.

研究の動機と目的

人間とコンピュータの相互作用のための自動感情認識を促進する。
複数のモダリティ（音声、テキスト、MoCap）を活用して堅牢性と精度を向上させる。
遅融合前にモダリティごとの最良のアーキテクチャを特定する。
欠損モダリティがあっても他のコンポーネントを再訓練する必要がないようにモジュール性を実現する。

提案手法

音声、テキスト、MoCapそれぞれのモダリティ特有のアーキテクチャを評価して上位モデルを特定する。
最良のモダリティ別モデルの最終層特徴融合を用い、256ニューロンの全結合層とsoftmaxで分類を行う。
最終的な多模態ネットワークに対してハイパーパラメータ最適化（Auptimizer）を適用する。
データの77.7%で訓練し、22.2%で検証（話者非依存の分割）。
MoCapデータには2D畳み込みを用いて3D CNNを回避し、訓練の高速化を図る。

実験結果

リサーチクエスチョン

RQ1モダリティ特化のディープラーニングモデルは、モダリティごとにIEMOCAPで強い感情認識を達成できるか？
RQ2最良のモダリティ別モデルの遅融合は、競争力のあるマルチモーダル性能をもたらすか？
RQ3モーションキャプチャデータ（ビデオと比較して）はマルチモーダル感情認識にどのような影響を与えるか？
RQ4提案されたモジュラー融合はIEMOCAP上の最先端マルチモーダルアーキテクチャと比較してどうか？

主な発見

モデル	精度
Text + Speech + Mocap Combined	71.04%
Poria [11]	71.59%

最終的なマルチモーダルモデル（Text_Model2 + Speech_Model4 + Mocap_Model1）は71.04%の精度を達成する。
Poria らは同じタスクで71.59%を達成しており、競争力のある性能を示す。
Speech_Model4（Attention-based bidirectional LSTM）は単一モダリティとして評価すると55.65%に達する。
Text_Model2（Glove埋め込みを用いた積層LSTM）は64.68%の精度に達する。
MoCap顔データを用いたCNN+LSTMの組み合わせ（Face_Model2）は、MoCapバリアントの中で最良の単一モダリティ性能を示し、頭部/手/顔で48.58–48.99%を達成。
モジュラな遅融合設計により、他のモダリティモデルを再訓練せずに任意の単一モダリティモデルを置換できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。