QUICK REVIEW

[論文レビュー] Speech Emotion Recognition Via CNN-Transformer and Multidimensional Attention Mechanism

Xiaoyu Tang, Yixin Lin|arXiv (Cornell University)|Mar 7, 2024

Speech and Audio Processing被引用数 7

ひとこと要約

本論文は、Time-Channel-Space の多次元注意機構を備えたCNN-Transformerを提案し、SERの局所情報と全体情報をモデル化してIEMOCAPとEmo-DBで性能向上を示す。

ABSTRACT

Speech Emotion Recognition (SER) is crucial in human-machine interactions. Mainstream approaches utilize Convolutional Neural Networks or Recurrent Neural Networks to learn local energy feature representations of speech segments from speech information, but struggle with capturing global information such as the duration of energy in speech. Some use Transformers to capture global information, but there is room for improvement in terms of parameter count and performance. Furthermore, existing attention mechanisms focus on spatial or channel dimensions, hindering learning of important temporal information in speech. In this paper, to model local and global information at different levels of granularity in speech and capture temporal, spatial and channel dependencies in speech signals, we propose a Speech Emotion Recognition network based on CNN-Transformer and multi-dimensional attention mechanisms. Specifically, a stack of CNN blocks is dedicated to capturing local information in speech from a time-frequency perspective. In addition, a time-channel-space attention mechanism is used to enhance features across three dimensions. Moreover, we model local and global dependencies of feature sequences using large convolutional kernels with depthwise separable convolutions and lightweight Transformer modules. We evaluate the proposed method on IEMOCAP and Emo-DB datasets and show our approach significantly improves the performance over the state-of-the-art methods.

研究の動機と目的

従来のCNNやRNNアプローチを超えて、局所情報と全体情報の両方を捉えることでSERを改善する動機づけ。
CNNに基づく局所特徴抽出とTransformerに基づく全体モデリングを統合するフレームワークを開発する。
時間・空間・チャネルにまたがる特徴を強化する temporal-channel-space 注意機構（T-Sa）を導入する。
局所的なディテールを保ちながら長距離依存を効率的にモデル化する軽量コンボリューション・トランスフォーマー（LCT）ブロックを提案する。
ベンチマークSERデータセットで有効性を示し、再現性のためにオープンソースコードを提供する。

提案手法

CNNブロックを用いて不規則な畳み込み(3x1および1x3)とプーリングにより局所的な時-周波数の音声特徴を抽出する。
BiLSTMベースのタイミング注意機構とShuffleベースの空間-チャネル注意機構からなるTime-Channel-Space (T-Sa) 注意モジュールを導入し、多次元特徴を豊かにする。
Large-Kernel Lightweight Convolutions、Coordinate Attention強化のMulti-Head Attention、およびSE-IBFFNを組み合わせたLCT (Lightweight Convolution Transformer) ブロックを設計し、局所-グローバル特徴融合を実現する。
LCTでは、局所情報に対して LLC、Coordinate Attention を用いた長距離依存性には CA-LMAM、反転残差を持つ SE-IBFFN で表現を強化する。
MFCC入力を前処理し、可変長の音声を1.8sのセグメントに変換して、最終決定のために文ごとに予測を平均化する。
mixup(alpha=0.2)で訓練し、クロスエントロピー損失、Adam最適化、150エポック、GPU上で学習率を減衰させる。

実験結果

リサーチクエスチョン

RQ1CNNブロックとTransformerモジュールを組み合わせることで、SERにおける局所特徴と全体特徴の両方をより良く捉えられるか？
RQ2時間-チャネル-空間注意機構は、時間的ダイナミクスと空間-チャネル依存性を活用することで感情認識を改善するか？
RQ3軽量なLCTブロックは、標準的なTransformerアプローチより少ないパラメータで競争力のある性能を達成できるか？
RQ4提案されたフレームワークはIEMOCAPとEmo-DBで最先端メソッドと比較してどうなるか？

主な発見

提案されたフレームワークは、IEMOCAPとEmo-DBで最先端メソッドと比較してSERの性能を改善する。
Time-Shuffle Attention (T-Sa) モジュールは、少ないパラメータで時間情報・空間情報・チャネル情報を強化する。
Lightweight Convolution Transformer（LCT）は、パラメータ数を減らしつつ局所および全体の依存関係を効果的に捉える。
不規則な時-周波数CNNブロックはTransformerモジュールの前に局所特徴を効果的に事前学習させ、小規模なSERデータセットでの収束を助ける。
実験設定にはMFCC特徴量、1.8sセグメントと1.6sのオーバーラップ、mixup訓練、標準の最適化設定が含まれる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。