QUICK REVIEW

[论文解读] Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Xiaoyu Tang, Yixin Lin|arXiv (Cornell University)|Mar 7, 2024

Speech and Audio Processing被引用 7

一句话总结

本文提出一种带多维注意机制的 CNN-Transformer（时间-通道-空间）以建模 SER 的局部与全局语音信息，在 IEMOCAP 与 Emo-DB 上显示出改进结果。

ABSTRACT

Speech Emotion Recognition (SER) is crucial in human-machine interactions. Mainstream approaches utilize Convolutional Neural Networks or Recurrent Neural Networks to learn local energy feature representations of speech segments from speech information, but struggle with capturing global information such as the duration of energy in speech. Some use Transformers to capture global information, but there is room for improvement in terms of parameter count and performance. Furthermore, existing attention mechanisms focus on spatial or channel dimensions, hindering learning of important temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods.

研究动机与目标

Motivate improved SER by capturing both local and global speech information beyond conventional CNN or RNN approaches.
Develop a framework that fuses CNN-based local feature extraction with Transformer-based global modeling.
Introduce a temporal-channel-space attention mechanism (T-Sa) to enhance features across time, space, and channels.
Propose a lightweight convolution Transformer (LCT) block to efficiently model long-range dependencies while preserving local details.
Demonstrate effectiveness on benchmark SER datasets and provide open-source code for reproducibility.

提出的方法

Use a CNN block to extract local time-frequency speech features via irregular convolutions (3x1 and 1x3) and pooling.
Introduce a Time-Channel-Space (T-Sa) attention module comprising a BiLSTM-based timing attention and a Shuffle-based space-channel attention to enrich multi-dimensional features.
Design an LCT (Lightweight Convolution Transformer) block combining Large-Kernel Lightweight Convolutions, Coordinate Attention-enhanced Multi-Head Attention, and SE-IBFFN for local-global feature fusion.
In LCT, apply LLC for local info, CA-LMAM for long-range dependencies with Coordinate Attention, and SE-IBFFN with inverted residuals to enhance representation.
Preprocess MFCC inputs, convert variable-length speech into 1.8s segments, and average predictions per sentence for final decision.
Train with mixup (alpha=0.2), use cross-entropy loss, Adam optimizer, 150 epochs, and a decaying learning rate on GPU.

实验结果

研究问题

RQ1Can CNN blocks combined with Transformer modules better capture both local and global speech features for SER?
RQ2Does a temporal-channel-space attention mechanism improve emotion recognition by leveraging temporal dynamics and spatial-channel dependencies?
RQ3Is a lightweight LCT block able to achieve competitive performance with fewer parameters than standard Transformer approaches?
RQ4How does the proposed framework perform on IEMOCAP and Emo-DB compared to state-of-the-art methods?

主要发现

The proposed framework improves SER performance on IEMOCAP and Emo-DB compared to state-of-the-art methods.
The Time-Shuffle Attention (T-Sa) module enhances temporal, spatial, and channel information with a small parameter footprint.
The Lightweight Convolution Transformer (LCT) effectively captures local and global dependencies with reduced parameter count.
Irregular time-frequency CNN blocks effectively pre-learn local features before Transformer modules, aiding convergence on small SER datasets.
Experimental setup includes MFCC features, 1.8s segments with 1.6s overlap, mixup training, and standard optimization settings.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。