QUICK REVIEW

[論文レビュー] How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Marc Casals-Salvador, Federico Costa|arXiv (Cornell University)|Mar 16, 2026

Emotion and Mood Recognition被引用数 0

ひとこと要約

この論文は、Speech Emotion Recognition（SER）を対象に MSP-Podcast で標準の自己注意に対して、RetNet、LightNet、GSA、FoX、KDA を用いた効率的な注意機構をベンチマークし、精度と効率のトレードオフを評価します。

ABSTRACT

Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency, providing practical insights for designing scalable SER systems.

研究の動機と目的

異なる seq2seq 注意機構が SER の性能とスケーラビリティに与える影響を評価する。
統一設定下で最先端の効率的注意変種を標準の自己注意とベンチマークする。
MSP-Podcast のバージョン間でのメモリ、学習/推論時間、およびロバスト性を分析する。

提案手法

固定された特徴抽出器で音声とテキストを処理し、異なる注意機構を持つ seq2seq モジュールで融合する。
同一アーキテクチャとデータセットの下で Softmax Attention（SA）を RetNet、LightNet、GSA、FoX、KDA と比較する。
Macro F-score を Dev、Test1（T1）、Test2（T2）で複数の SSL バックボーンにわたり測定する。
推論待機時間とピーク GPU メモリを評価して効率を定量化する。
特徴抽出器を凍結し、seq2seq、注意機構プーリング、分類器のみを学習する。
効率の結果を評価するために 20 エポック、AdamW 最適化、1x バッチサイズを使用する。

Figure 1: System's architecture. Experiments are made considering different attention mechanisms for the seq2seq module.

実験結果

リサーチクエスチョン

RQ1各注意機構は MSP-Podcast における SER 精度と計算効率の最適なトレードオフを提供するか？
RQ2シーケンス長が増加するにつれて、効率的注意変種は SA と比較して推論待機時間とメモリ使用量でどのように比較されるか？
RQ3 MSP-Podcast の Test1 および Test2 条件に対する各機構のロバスト性はどの程度か？
RQ4SSL バックボーンの選択は各注意機構の相対的性能にどのように影響するか？

主な発見

SA は効率的変種よりも全体的に評価分割（Test1およびTest2）で最も強い一般化を示すが、効率的変種はより良いスケーラビリティを提供する。
LightNet は全体として開発データの平均スコアが最高で、Wav2Vec2XLSR を用いた最も強い単一結果は 38.11% である。
効率的機構は推論時間とシーケンス長の増加に対して線形にスケールする一方、SA は二次的にスケールする（10 s で 0.55 ms、400 s で 48.59 ms）。
KDA は長いシーケンスで最も速い（400 s で 5.96 ms）、FoX は最もメモリ効率が良い（400 s で 0.328 GB）。
すべての手法は Test1 から Test2 へ性能が低下しており、より現実的で不均衡な条件下でのロバスト性にギャップがあることを示している。
効率的アーキテクチャは SA の精度に近づく可能性があり、待機時間とメモリ使用量の点で大幅な改善を提供する。

Figure 2: Inference time and peak GPU memory usage of the seq2seq module as a function of sequence length on the MSP-Podcast dev set [ 8003425 ] . Panels (a–b) report results for all models. Panels (c–d) provide a zoomed view excluding SA to make the relative growth trends of the remaining alternati

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。