QUICK REVIEW

[論文レビュー] EmoTech: A Multi-modal Speech Emotion Recognition Using Multi-source Low-level Information with Hybrid Recurrent Network

Shamin Bin Habib Avro, Taieba Taher|ArXiv.org|Jan 22, 2025

Emotion and Mood Recognition被引用数 3

ひとこと要約

tldr: EmoTech presents a multimodal emotion recognition system that fuses audio (MFCC-based BiLSTM and Conv2D) and text (embeddings with BiLSTM and Conv1D) features, achieving about 84% accuracy on IEMOCAP for five emotions.

ABSTRACT

Emotion recognition is a critical task in human-computer interaction, enabling more intuitive and responsive systems. This study presents a multimodal emotion recognition system that combines low-level information from audio and text, leveraging both Convolutional Neural Networks (CNNs) and Bidirectional Long Short-Term Memory Networks (BiLSTMs). The proposed system consists of two parallel networks: an Audio Block and a Text Block. Mel Frequency Cepstral Coefficients (MFCCs) are extracted and processed by a BiLSTM network and a 2D convolutional network to capture low-level intrinsic and extrinsic features from speech. Simultaneously, a combined BiLSTM-CNN network extracts the low-level sequential nature of text from word embeddings corresponding to the available audio. This low-level information from speech and text is then concatenated and processed by several fully connected layers to classify the speech emotion. Experimental results demonstrate that the proposed EmoTech accurately recognizes emotions from combined audio and text inputs, achieving an overall accuracy of 84%. This solution outperforms previously proposed approaches for the same dataset and modalities.

研究の動機と目的

Motivate improved SER by leveraging complementary audio and text modalities.
Propose a two-branch architecture (Audio Block and Text Block) to extract low-level features.
Fuse audio and text features and classify emotions with a dense classifier.
Evaluate on IEMOCAP with data augmentation to address class imbalance.
Demonstrate that multimodal integration outperforms single-modality approaches.

提案手法

Use MFCCs from speech as input to a BiLSTM and a 2D CNN in the Audio Block.
Process text transcripts via embeddings feeding a BiLSTM and a Conv1D with global max pooling in the Text Block.
Concatenate the audio and text block outputs into a shared classifier with three dense layers and softmax output.
Train with 5-fold cross-validation on 5,633 augmented samples using Adam optimizer and categorical cross-entropy loss.
Apply data augmentation to balance classes and improve performance.
Total model parameters: 7,295,821.

実験結果

リサーチクエスチョン

RQ1Can a multimodal architecture combining low-level audio and text features improve SER accuracy on IEMOCAP?
RQ2What is the impact of data augmentation on minority classes and overall accuracy?
RQ3How does EmoTech compare to existing single- and multi-modal SER approaches on the same dataset?

主な発見

Model	Feature	Accuracy(%)
Yoon et al. (2018)	Speech+Text	71.80
Yenigalla et al. (2018)	Speech+Phoneme	73.90
Atmaja et al. (2019)	Speech+Text	75.40
EmoTech	Speech+Text	83.52

Combined speech and text features yield higher accuracy than single modalities, with augmentation further improving performance.
Overall accuracy for the EmoTech model on Speech+Text after augmentation is 83.52%.
Per-class metrics show high precision/recall for Anger (≈0.9728), Sad (≈0.9695), and Excited (≈0.9252).
Neutral is more challenging, with lower accuracy (~0.8153).
EmoTech outperforms several existing models on IEMOCAP for the same modality pairing (Speech+Text).
The proposed hybrid BiLSTM-CNN architecture effectively captures temporal and local features in both audio and text.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。