QUICK REVIEW

[논문 리뷰] Multi-Modal Emotion recognition on IEMOCAP Dataset using Deep Learning

Samarth Tripathi, Tripathi, Sarthak|arXiv (Cornell University)|2018. 04. 16.

Emotion and Mood Recognition참고 문헌 20인용 수 69

한 줄 요약

이 논문은 음성, 텍스트, 모션 캡처 데이터를 사용한 IEMOCAP용 모듈형 다중 모달 감정 인식 시스템을 제시하고, 최종 계층에서 모달리티별 모델을 융합하며 OpenOPT 도구로 하이퍼파라미터를 튜닝합니다.

ABSTRACT

Emotion recognition has become an important field of research in Human Computer Interactions as we improve upon the techniques for modelling the various aspects of behaviour. With the advancement of technology our understanding of emotions are advancing, there is a growing need for automatic emotion recognition systems. One of the directions the research is heading is the use of Neural Networks which are adept at estimating complex functions that depend on a large number and diverse source of input data. In this paper we attempt to exploit this effectiveness of Neural networks to enable us to perform multimodal Emotion recognition on IEMOCAP dataset using data from Speech, Text, and Motion capture data from face expressions, rotation and hand movements. Prior research has concentrated on Emotion detection from Speech on the IEMOCAP dataset, but our approach is the first that uses the multiple modes of data offered by IEMOCAP for a more robust and accurate emotion detection.

연구 동기 및 목표

인간-컴퓨터 상호 작용을 위한 자동 감정 인식을 동기화합니다.
다중 모달리티(음성, 텍스트, MoCap)를 활용해 로버스트니스와 정확도를 향상시킵니다.
늦은 융합 이전에 모달리티별 최상의 아키텍처를 식별합니다.
모듈화를 통해 일부 모달리티가 누락되어도 모든 구성요소를 재학습할 필요가 없도록 합니다.]
method

제안 방법

음성, 텍스트, MoCap에 대한 모달리티별 아키텍처를 평가해 우수 모델을 식별합니다.
각 모달리티별 최상위 모델의 최종 레이어 피처 융합을 256-노드 FC 계층과 소프트맥스로 분류합니다.
최종 다중 모달 네트워크에서 하이퍼파라미터 최적화를 수행합니다.
발화자 무시(split) 분할으로 데이터의 77.7%에 대해 학습하고 22.2%에서 테스트합니다.
MoCap 데이터에 대해 3D CNN을 피하고 2D 컨볼루션을 사용해 더 빠른 학습을 가능하게 합니다.

실험 결과

연구 질문

RQ1모든 모달리티에 대해 모달리티별 심층 학습 모델이 IEMOCAP에서 강력한 감정 인식 성능을 달성할 수 있는가?
RQ2최고의 모달리티 모델을 늦은 융합으로 결합했을 때 경쟁력 있는 다중 모달 성능을 얻을 수 있는가?
RQ3다중 모달 감정 인식에서 모션 캡처 데이터(vs. 비디오)의 사용이 미치는 영향은 무엇인가?
RQ4제안된 모듈식 융합이 IEMOCAP에서 최신의 다중 모달 아키텍처와 비교해 어떤 성능 차이가 나는가?

주요 결과

모델	정확도
Text + Speech + Mocap Combined	71.04%
Poria [11]	71.59%

최종 다중 모달 모델(Text_Model2 + Speech_Model4 + Mocap_Model1)이 71.04%의 정확도를 달성합니다.
Poria 등은 동일한 작업에서 71.59%를 달성하여 경쟁력 있는 성능을 시사합니다.
Speech_Model4(주의 기반 양방향 LSTM)은 단일 모달로 평가 시 55.65%에 도달합니다.
Text_Model2(Glove 임베딩을 사용한 스택형 LSTM)는 64.68%의 정확도에 도달합니다.
MoCap 얼굴 데이터와 CNN+LSTM 조합(Face_Model2)은 MoCap 변형들 중 단일 모달 최상의 성능을 보여주며(머리/손/얼굴 각각 48.58–48.99%),
모듈식 늦은 융합 설계는 다른 모달리티 모델을 재학습하지 않고도 단일 모달 모델을 교체할 수 있게 합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.