QUICK REVIEW

[논문 리뷰] EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network

Shamin Bin Habib Avro, Taieba Taher|ArXiv.org|2025. 01. 22.

Emotion and Mood Recognition인용 수 3

한 줄 요약

EmoTech는 오디오(MFCC 기반 BiLSTM 및 Conv2D)와 텍스트(임베딩에 BiLSTM 및 Conv1D)를 결합한 멀티모달 감정 인식 시스템을 제시하며, IEMOCAP에서 다섯 가지 감정에 대해 약 84% 정확도를 달성한다.

ABSTRACT

Emotion recognition is a critical task in human-computer interaction, enabling more intuitive and responsive systems. This study presents a multimodal emotion recognition system that combines low-level information from audio and text, leveraging both Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory Networks (BiLSTMs). The proposed system consists of two parallel networks: an Audio Block and a Text Block. Mel Frequency Cepstral Coefficients (MFCCs) are extracted and processed by a BiLSTM network and a 2D convolutional network to capture low-level intrinsic and extrinsic features from speech. Simultaneously, a combined BiLSTM-CNN network extracts the low-level sequential nature of text from word embeddings corresponding to the available audio. This low-level information from speech and text is then concatenated and processed by several fully connected layers to classify the speech emotion. Experimental results demonstrate that the proposed EmoTech accurately recognizes emotions from combined audio and text inputs, achieving an overall accuracy of 84%. This solution outperforms previously proposed approaches for the same dataset and modalities.

연구 동기 및 목표

보완적인 오디오 및 텍스트 모달리티를 활용하여 SER을 향상시키는 동기를 부여한다.
저수준 특징을 추출하기 위한 두 분기 아키텍처(Audio Block과 Text Block)을 제안한다.
오디오 및 텍스트 특징을 융합하고 Dense 분류기로 감정을 분류한다.
데이터 증강을 통해 클래스 불균형을 해소하며 IEMOCAP에서 평가한다.
멀티모달 통합이 단일 모달 접근법보다 성능이 우수하다는 것을 보여준다.

제안 방법

Audio Block에서 음성의 MFCC를 입력으로 삼아 BiLSTM과 2D CNN을 사용한다.
Text Block에서 임베딩을 통해 BiLSTM과 Conv1D를 이용하고 Global max pooling을 적용하여 텍스트 전사를 처리한다.
오디오 Block과 텍스트 Block의 출력을 연결하여 세 개의 dense layers와 softmax 출력이 있는 공유 classifier로 결합한다.
Adam 옵티마이저와 범주형 교차 엔트로피 손실을 사용하여 5,633개 확장 샘플에 대해 5-폴드 교차 검증으로 학습한다.
클래스 균형 및 성능 향상을 위해 데이터 증강을 적용한다.
총 모델 파라미터 수: 7,295,821.

실험 결과

연구 질문

RQ1저수준 오디오 및 텍스트 특징을 결합한 멀티모달 아키텍처가 IEMOCAP에서 SER 정확도를 향상시킬 수 있는가?
RQ2소수 클래스와 전체 정확도에 대한 데이터 증강의 영향은 무엇인가?
RQ3같은 데이터셋에서 EmoTech가 기존의 단일 모달 및 멀티모달 SER 접근법과 어떻게 비교되는가?

주요 결과

모델	특징	정확도(%)
Yoon et al. (2018)	Speech+Text	71.80
Yenigalla et al. (2018)	Speech+Phoneme	73.90
Atmaja et al. (2019)	Speech+Text	75.40
EmoTech	Speech+Text	83.52

음성 및 텍스트 특징의 결합은 단일 모달보다 더 높은 정확도를 제공하며, 증강은 성능을 더욱 향상시킨다.
증강 후 Speech+Text에서 EmoTech 모델의 전체 정확도는 83.52%이다.
클래스별 지표는 Anger(≈0.9728), Sad(≈0.9695), Excited(≈0.9252)에 대해 높은 정밀도/재현율을 보여준다.
Neutral은 더 도전적이며 정확도가 더 낮다(≈0.8153).
EmoTech은 동일 모달 페어링(Speech+Text)에서 IEMOCAP의 여러 기존 모델보다 우수하다.
제안된 하이브리드 BiLSTM-CNN 아키텍처는 오디오와 텍스트 모두에서 시간적 및 로컬 특징을 효과적으로 포착한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.