QUICK REVIEW

[论文解读] Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs

Lalaram Arya, Mrinmoy Bhattacharjee|arXiv (Cornell University)|Jan 22, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

本文提出 DS2ST-LM，一种单阶段、由大语言模型驱动的直接语音到语音翻译框架，具有大规模语义对齐数据集、三种投影架构，以及跨多语言对的音色控制合成。

ABSTRACT

Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.

研究动机与目标

通过解决稀缺并行语音数据导致的语义–声学对齐不稳定性问题。
在跨多语言对翻译时保持说话人身份。
实现基于LLM的解码器和音色感知的音频合成， enabling 可扩展的单阶段直接 S2ST。
创建并发布大规模、语义对齐的 S2ST 数据以支持研究。
评估投影架构与语义标记生成策略对训练稳定性与翻译质量的影响。

提出的方法

将 Whisper 语音编码器、一个可学习的投影模块、一个 Qwen 2-0.5B LLM，以及一个音色可控的声码器整合进单阶段的 DS2ST-LM 框架。
构建 GigaS2S-1000，包含 1000 小时中英双语语料，使用 XTTS-v2 生成高保真中文合成语音。
使用来自语音的监督语义标记生成（S3 令牌）以及通过预训练的大语言模型得到的文本语义标记，用于训练阶段。
探索三种投影架构（Linear、Conv1D–Linear、Q-Former），将语音嵌入映射到 LLM 空间，并分析收敛性与翻译质量。
利用语义分组建模在解码过程中对齐音频与文本令牌速率，以及联合音频/文本令牌损失。
引入以说话人提示为条件的音色可控神经声码器，以合成目标语音并保留音色。

实验结果

研究问题

RQ1与级联与 ST+TTS 基线相比，DS2ST-LM 在多语言对的直接 S2ST 表现如何？
RQ2投影架构（Linear、Conv1D–Linear、Q-Former）对训练稳定性与翻译质量有何影响？
RQ3语义标记生成策略（语音衍生的 S3 与文本衍生的标记）对语义对齐与模型稳定性有何影响？
RQ4音色感知合成在直接 S2ST 中是否能在保持翻译质量的同时保留说话人身份？
RQ5合成数据（GigaS2S-1000）是否缓解直接 S2ST 跨语言训练的数据稀缺？

主要发现

Model / Datasets	Seamless-Align (zh–en) BLEU	Seamless-Align (zh–en) METEOR	Seamless-Align (zh–en) BLEURT	Seamless-Align (zh–en) COMET	GigaS2S-1000 (zh–en) BLEU	GigaS2S-1000 (zh–en) METEOR	GigaS2S-1000 (zh–en) BLEURT	GigaS2S-1000 (zh–en) COMET	FLEURS (zh–en) BLEU	FLEURS (zh–en) METEOR	FLEURS (zh–en) BLEURT	FLEURS (zh–en) COMET
Cascaded	4.78	0.25	0.30	0.34	6.84	0.16	0.37	0.39	5.78	0.23	0.36	0.38
ST + TTS	5.91	0.27	0.35	0.49	11.36	0.32	0.43	0.54	9.17	0.25	0.41	0.53
DS2ST-LM	7.11	0.37	0.42	0.58	14.71	0.45	0.53	0.71	11.46	0.45	0.53	0.68

DS2ST-LM 在跨多个数据集的词汇与语义指标上超过了级联和 ST+TTS 基线。
在 Seamless-Align zh–en 上，DS2ST-LM 实现更高的 BLEU（7.11）和 BLEURT（0.42）相比基线。
在 GigaS2S-1000 zh–en 上，DS2ST-LM 达到 BLEU 14.71 和 BLEURT 0.53，超越基线。
在 FLEURS zh–en 上，DS2ST-LM 获得 BLEU 11.46 和 BLEURT 0.53，超越基线。
更大的投影容量加速收敛，但在此设置中线性投影获得最高性能。
音色感知的合成相比先前的直接 S2ST 系统提升说话人相似度和感知自然度。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。