QUICK REVIEW

[논문 리뷰] Timbre-Aware LLM-based Direct Speech-to-Speech Translation Extendable to Multiple Language Pairs

Lalaram Arya, Mrinmoy Bhattacharjee|arXiv (Cornell University)|2026. 01. 22.

Speech Recognition and Synthesis인용 수 0

한 줄 요약

이 논문은 DS2ST-LM을 제시합니다. 이는 대규모 의미적으로 정렬된 데이터셋, 세 가지 프로젝션 아키텍처, 그리고 다언어 쌍에 걸친 음색 제어 합성을 갖춘 단일 단계의 LLM 주도 직접 음성-음성 번역 프레임워크입니다.

ABSTRACT

Direct Speech-to-Speech Translation (S2ST) has gained increasing attention for its ability to translate speech from one language to another, while reducing error propagation and latency inherent in traditional cascaded pipelines. However, existing direct S2ST systems continue to face notable challenges, including instability in semantic-acoustic alignment when parallel speech data is scarce, difficulty in preserving speaker identity, and limited multilingual scalability. In this work, we introduce DS2ST-LM, a scalable, single-stage direct S2ST framework leveraging a multilingual Large Language Model (LLM). The architecture integrates a Whisper speech encoder, a learnable projection module, a Qwen2-0.5B LLM, and a timbre-controlled vocoder. We construct GigaS2S-1000, a 1000-hour bilingual corpus by extending the GigaST dataset with high-fidelity synthetic target speech, and show that this synthetic data alleviates data scarcity to some extent. We investigate two semantic token generation strategies: speech-derived S3 tokens and text-derived tokens generated by a pre-trained LLM, and analyze their impact on training stability and semantic consistency. We further evaluate three projection architectures (Linear, Conv1D-Linear, and Q-Former) and observe that while higher-capacity projectors converge faster, the simple Linear projector achieves higher performance. Extensive experiments demonstrate that DS2ST-LM outperforms traditional cascaded and ST (Qwen-Audio) + TTS baselines across both lexical (BLEU, METEOR) and semantic (BLEURT, COMET) metrics, while extending to multiple language pairs, including French, Spanish, German, Hindi, Bengali, and Urdu. Furthermore, we incorporate timbre-aware speech synthesis to preserve speaker information, enabling DS2ST-LM to surpass prior direct S2ST systems in both speaker similarity and perceptual naturalness.

연구 동기 및 목표

희소한 병렬 음성 데이터로 인한 의미-음향 정합의 불안정성 해결.
다수의 언어 쌍에서 화자 신원을 보존하면서 번역.
LLM 기반 디코더와 음색 인식 보코딩으로 확장 가능한 단일 단계 직접 S2ST 구현.
연구를 지원하기 위한 대규모 의미 정렬 S2ST 데이터 생성 및 공개.
학습 안정성과 번역 품질을 위한 프로젝션 아키텍처 및 의미 토큰 생성 전략 평가.

제안 방법

Whisper 음성 인코더, 학습 가능한 프로젝션 모듈, Qwen 2-0.5B LLM, 그리고 화자 프롬프트에 조건부 음색 제어 보코더를 단일 단계 DS2ST-LM 프레임워크에 통합합니다.
XTTS-v2를 사용하여 고충실도 합성 중국어 음성으로 1000시간의 양방향 중국어–영어 코퍼스인 GigaS2S-1000을 구성합니다.
훈련 체계에 대해 음성에서의 감독 가능한 의미 토큰 생성(S3 토큰)과 사전 학습된 LLM을 통한 텍스트로부터의 의미 토큰을 활용합니다.
음성 임베딩을 LLM 공간으로 매핑하기 위한 세 가지 프로젝션 아키텍처(Linear, Conv1D–Linear, 및 Q-Former)를 탐구하고 수렴 및 번역 품질을 분석합니다.
해독 시 음성 토큰과 텍스트 토큰의 비율 정렬 및 공동 음성/텍스트 토큰 손실을 위해 의미 그룹 모델링을 활용합니다.
화자 프롬프트에 조건부 음색 제어 신경 보코더를 도입하여 타깃 음성을 음색 보존과 함께 합성합니다.

실험 결과

연구 질문

RQ1DS2ST-LM이 여러 언어 쌍에 대해 직접 S2ST를 수행할 때 cascaded 및 ST+TTS 기반 비교대안과 비교하여 어떤 성능을 보이는가?
RQ2학습 안정성과 번역 품질에 미치는 프로젝션 아키텍처(Linear, Conv1D–Linear, Q-Former)의 영향은 무엇인가?
RQ3의미 토큰 생성 전략(음성 파생 S3 vs 텍스트 파생 토큰)이 의미 정합성과 모델 안정성에 어떤 영향을 미치는가?
RQ4음색 인지 합성으로 직접 S2ST에서 화자 정체성을 보존하면서 번역 품질을 유지할 수 있는가?
RQ5합성 데이터(GigaS2S-1000)가 언어 간 직접 S2ST 훈련의 데이터 부족 문제를 완화하는가?

주요 결과

모델 / 데이터세트	Seamless-Align (zh–en) BLEU	Seamless-Align (zh–en) METEOR	Seamless-Align (zh–en) BLEURT	Seamless-Align (zh–en) COMET	GigaS2S-1000 (zh–en) BLEU	GigaS2S-1000 (zh–en) METEOR	GigaS2S-1000 (zh–en) BLEURT	GigaS2S-1000 (zh–en) COMET	FLEURS (zh–en) BLEU	FLEURS (zh–en) METEOR	FLEURS (zh–en) BLEURT	FLEURS (zh–en) COMET
Cascaded	4.78	0.25	0.30	0.34	6.84	0.16	0.37	0.39	5.78	0.23	0.36	0.38
ST + TTS	5.91	0.27	0.35	0.49	11.36	0.32	0.43	0.54	9.17	0.25	0.41	0.53
DS2ST-LM	7.11	0.37	0.42	0.58	14.71	0.45	0.53	0.71	11.46	0.45	0.53	0.68

DS2ST-LM은 다수의 데이터셋에서 어휘 및 의미 지표 측면에서 cascaded 및 ST+TTS 기반 대안들보다 우수합니다.
Seamless-Align zh–en에서 DS2ST-LM은 baselines보다 BLEU(7.11) 및 BLEURT(0.42)가 더 높습니다.
GigaS2S-1000 zh–en에서 DS2ST-LM은 BLEU 14.71 및 BLEURT 0.53을 달성하며 baselines를 능가합니다.
FLEURS zh–en에서 DS2ST-LM은 BLEU 11.46 및 BLEURT 0.53으로 베이스라인을 상회합니다.
더 큰 프로젝션 용량은 수렴 속도를 촉진하지만, 이 설정에서 선형 프로젝션이 가장 높은 성능을 보입니다.
음색 인지 합성은 기존의 직접 S2ST 시스템에 비해 화자 유사도 및 지각적 자연성을 향상시킵니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.