QUICK REVIEW

[论文解读] How Attention Shapes Emotion: A Comparative Study of Attention Mechanisms for Speech Emotion Recognition

Marc Casals-Salvador, Federico Costa|arXiv (Cornell University)|Mar 16, 2026

Emotion and Mood Recognition被引用 0

一句话总结

本论文将高效注意力机制（RetNet、LightNet、GSA、FoX、KDA）在 MSP-Podcast 上针对语音情感识别（SER）与标准自注意力进行基准比较，评估准确性与效率权衡。

ABSTRACT

Speech Emotion Recognition (SER) plays a key role in advancing human-computer interaction. Attention mechanisms have become the dominant approach for modeling emotional speech due to their ability to capture long-range dependencies and emphasize salient information. However, standard self-attention suffers from quadratic computational and memory complexity, limiting its scalability. In this work, we present a systematic benchmark of optimized attention mechanisms for SER, including RetNet, LightNet, GSA, FoX, and KDA. Experiments on both MSP-Podcast benchmark versions show that while standard self-attention achieves the strongest recognition performance across test sets, efficient attention variants dramatically improve scalability, reducing inference latency and memory usage by up to an order of magnitude. These results highlight a critical trade-off between accuracy and efficiency, providing practical insights for designing scalable SER systems.

研究动机与目标

评估不同 seq2seq 注意力机制如何影响 SER 的性能与可扩展性。
在统一设置下对比最先进的高效注意力变体与标准自注意力。
分析内存、训练/推理时间及在 MSP-Podcast 版本上的鲁棒性。

提出的方法

通过固定特征提取器处理语音和文本，并通过带有不同注意力机制的 seq2seq 模块进行融合。
在相同架构和数据集下，将 Softmax Attention (SA) 与 RetNet、LightNet、GSA、FoX、KDA 进行比较。
在 Dev、Test1 (T1) 和 Test2 (T2) 上，使用多种 SSL Backbone 测量宏观 F-score。
评估推理时延和峰值显存以量化效率。
冻结特征提取器；仅训练 seq2seq、注意力池化和分类器。
使用 20 个 epoch、AdamW 优化和 1x 批量大小进行效率结果评估。

Figure 1: System's architecture. Experiments are made considering different attention mechanisms for the seq2seq module.

实验结果

研究问题

RQ1每种注意力机制在 MSP-Podcast 上是否为 SER 准确性与计算效率之间提供最佳权衡？
RQ2随着序列长度增加，高效注意力变体在推理时延和内存使用上与 SA 相比有何差异？
RQ3每种机制在 MSP-Podcast Test1 与 Test2 条件下的鲁棒性如何？
RQ4SSL 主干网络的选择如何影响每种注意力机制的相对性能？

主要发现

Softmax Attention (SA) 在评估分割（Test1 与 Test2）上总体上具有最强的一般化能力，尽管高效变体提供更好的可扩展性。
LightNet 在总体开发分数（Dev）上达到最高均值，且单次最强结果为 38.11%（使用 Wav2Vec2XLSR）。
高效机制在推理时间和内存方面随序列长度线性扩展，而 SA 线性或二次扩展（在 10 s 时为 0.55 ms，在 400 s 时为 48.59 ms）。
KDA 在较长序列下是高效机制中最快的（400 s 时 5.96 ms），FoX 在内存最节省（400 s 时 0.328 GB）。
所有方法在从 Test1 到 Test2 时均出现性能下降，表明在更现实且不平衡条件下存在鲁棒性缺口。
高效架构在精度方面接近 SA，同时在延迟和显存方面带来显著改善。

Figure 2: Inference time and peak GPU memory usage of the seq2seq module as a function of sequence length on the MSP-Podcast dev set [ 8003425 ] . Panels (a–b) report results for all models. Panels (c–d) provide a zoomed view excluding SA to make the relative growth trends of the remaining alternati

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。