QUICK REVIEW

[论文解读] StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Haishu Zhao, Aokai Hao|arXiv (Cornell University)|Mar 8, 2026

Emotion and Mood Recognition被引用 0

一句话总结

StyleBench 提供一个多轮基准，量化语音语言模型在跨轮对话中对说话风格（情感、速度、音量、音调）的控制能力，揭示 SLM 与全语言模型之间的差距，并突出数据与分词器因素。

ABSTRACT

Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.

研究动机与目标

动机：需要量化 SLM 在多轮对话中对风格提示的遵从程度。
目标：构建 StyleBench 以在多轮对话中评估四个维度的风格强度控制。
目标：分析影响风格控制性能的因素，如训练数据与语音分词器。

提出的方法

创建一个双语三轮对话数据集，在单一维度上使风格强度在轮之间变化。
所有话语使用 CosyVoice2 合成；情感维度以 RAVDESS 作为参考音频；其他维度通过 FFmpeg 处理。
定义结合自动风格度量与人工评估的指标；量化跨轮的有效性与风格变化（VSP 与 SVD）。
在单轮与多轮指令遵循与说话风格控制上评估 10 个开源 SLM。
分析训练数据组成、语音分词器设计等因素以解释性能差距。

实验结果

研究问题

RQ1当前的 SLM 在单轮与多轮对话中对风格提示的遵循程度如何？
RQ2模型能否在对话轮之间保持一致的意图与风格控制？
RQ3在控制情感、速度、音量、音调方面，模型的相对强项和短板是什么？
RQ4训练数据与语音分词器如何影响说话风格控制的性能？

主要发现

大多数模型在单轮指令相关性方面表现较高，但在多轮连贯性（MRD）方面存在差异。
仅有部分模型（如 Qwen2.5-omni、GLM-4-Voice、Kimi-Audio）超过 60% 的 MRD，表明具有较强的多轮一致性。
Kimi-Audio 与 GLM-4-Voice 在速度、音量与音调的风格控制上表现最强，具备较高的 VSP 与 SVD。
LLaMA-omni2 与 Baichuan-omni-1.5 对情感调整对提示的响应有限。
模型性能差距与训练数据构成及保留语音元音提示的分词器使用有关。
StyleBench 提供关于分词器设计（如 GLM-4-Voice 分词器）如何影响声学变异保留的洞见。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。