QUICK REVIEW

[论文解读] Deep Learning based Emotion Recognition System Using Speech Features and Transcriptions

Suraj Tripathi, Abhay Kumar|arXiv (Cornell University)|Jun 11, 2019

Emotion and Mood Recognition参考文献 24被引用 56

一句话总结

The paper proposes a speech emotion recognition method that combines speech features (spectrogram, MFCC) with text transcriptions using various deep neural network architectures, with MFCC-Text CNN achieving the best accuracy on IEMOCAP data.

ABSTRACT

This paper proposes a speech emotion recognition method based on speech features and speech transcriptions (text). Speech features such as Spectrogram and Mel-frequency Cepstral Coefficients (MFCC) help retain emotion-related low-level characteristics in speech whereas text helps capture semantic meaning, both of which help in different aspects of emotion detection. We experimented with several Deep Neural Network (DNN) architectures, which take in different combinations of speech features and text as inputs. The proposed network architectures achieve higher accuracies when compared to state-of-the-art methods on a benchmark dataset. The combined MFCC-Text Convolutional Neural Network (CNN) model proved to be the most accurate in recognizing emotions in IEMOCAP data.

研究动机与目标

通过利用声学特征和来自转录文本的语义信息来激发并提升情感识别。
评估不同语音特征和文本输入的组合如何影响识别准确率。
确定最能利用多模态输入的网络架构用于语音情感识别。

提出的方法

提取声学特征，如 spectrogram 和 MFCC，以保留低层次情感线索。
结合语音转录文本以捕捉与情感相关的语义含义。
尝试多种 DNN 架构，采用不同的特征输入组合。
在 IEMOCAP 基准数据集上进行训练和评估。
将所提出的模型与最先进的方法进行比较。
确定 MFCC-Text CNN 与组合输入在准确性上达到最高。

实验结果

研究问题

RQ1将语音特征与转录文本结合是否能在准确性方面优于仅使用任一模态？
RQ2哪种神经网络架构最能融合声学与文本信息用于情感识别？
RQ3spectrogram 和 MFCC 特征在 CNN/DNN 模型中与文本输入如何交互以完成此任务？

主要发现

在他们的实验中，结合 MFCC 和文本输入并使用 CNN 在 IEMOCAP 上实现了最高准确性。
语音特征有助于保留低层次情感线索，而转录文本捕捉语义含义以实现更好的辨别。
所提出的网络在基准数据集上超越了最先进的方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。