QUICK REVIEW

[論文レビュー] StyleBench: Evaluating Speech Language Models on Conversational Speaking Style Control

Haishu Zhao, Aokai Hao|arXiv (Cornell University)|Mar 8, 2026

Emotion and Mood Recognition被引用数 0

ひとこと要約

StyleBenchは、多ターンに渡って話し方のスタイル（感情、速度、声量、ピッチ）を制御する能力を定量化するマルチターンベンチマークを提供し、SLMとオムニ言語モデル間のギャップとデータ/トークナイザ要因を浮き彫りにする。

ABSTRACT

Speech language models (SLMs) have significantly extended the interactive capability of text-based Large Language Models (LLMs) by incorporating paralinguistic information. For more realistic interactive experience with customized styles, current SLMs have managed to interpret and control speaking style intensity from user prompts during the dialogue process. However, there remains a lack of systematic benchmarks that quantifies and evaluates the style intensity control ability in conversations. In this paper, we propose StyleBench, a multi-turn dialogue benchmark for comprehensively evaluating the style intensity control ability across four dimensions: emotion, speed, volume, and pitch. Our results reveal the performance gaps between leading SLMs and omni language models (OLMs), suggesting the underlying reasons and promising approaches for future exploration.

研究の動機と目的

Motivation: there's a need to quantify how well SLMs follow stylistic prompts in multi-turn conversations.
Objective: construct StyleBench to evaluate style intensity control across four dimensions in multi-turn dialogues.
Objective: analyze factors like training data and speech tokenizers that affect style control performance.

提案手法

Create a bilingual three-turn dialogue dataset where style intensity varies across turns for a single dimension at a time.
Synthesize all utterances with CosyVoice2 and, for emotion, use RAVDESS as reference audio; other dimensions use FFmpeg processing.
Define metrics combining automatic style measures and human evaluation; quantify validity and style variation across turns (VSP and SVD).
Evaluate 10 open-source SLMs on single-turn and multi-turn instruction-following and speaking style control.
Analyze factors such as training data composition and speech tokenizer design to explain performance gaps.

実験結果

リサーチクエスチョン

RQ1How well do current SLMs follow stylistic prompts in single-turn versus multi-turn dialogues?
RQ2Can models maintain consistent intent and style control across dialogue turns?
RQ3What are the relative strengths and weaknesses of models in controlling emotion, speed, volume, and pitch?
RQ4How do training data and speech tokenizers influence speaking style control performance?

主な発見

Most models show high single-turn instruction relevance but varying multi-turn coherence (MRD).
Only a subset (e.g., Qwen2.5-omni, GLM-4-Voice, Kimi-Audio) exceed 60% MRD, indicating reliable multi-turn consistency.
Kimi-Audio and GLM-4-Voice demonstrate strongest style control across Speed, Volume, and Pitch, with high VSP and SVD.
LLaMA-omni2 and Baichuan-omni-1.5 show limited emotional adjustment responses to prompts.
Model performance gaps are linked to training data composition and the use of speech tokenizers that preserve paralinguistic cues.
StyleBench provides insights into how speech tokenizer design (e.g., GLM-4-Voice tokenizer) impacts acoustic variation retention.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。