QUICK REVIEW

[论文解读] Learning pronunciation from a foreign language in speech synthesis networks

Younggun Lee, Suwon Shon|arXiv (Cornell University)|Nov 23, 2018

Speech Recognition and Synthesis参考文献 15被引用 24

一句话总结

本文提出了一种多语言语音合成框架，利用不同语言间的音素相似性来提升低资源语言的文本到语音合成质量。通过在高资源语言数据上进行预训练，并使用有限的低资源语言数据进行微调，模型学习到反映跨语言发音相似性的共享音素嵌入，显著提升了合成质量，并在10种语言间实现泛化。

ABSTRACT

Although there are more than 6,500 languages in the world, the pronunciations of many phonemes sound similar across the languages. When people learn a foreign language, their pronunciation often reflects their native language's characteristics. This motivates us to investigate how the speech synthesis network learns the pronunciation from datasets from different languages. In this study, we are interested in analyzing and taking advantage of multilingual speech synthesis network. First, we train the speech synthesis network bilingually in English and Korean and analyze how the network learns the relations of phoneme pronunciation between the languages. Our experimental result shows that the learned phoneme embedding vectors are located closer if their pronunciations are similar across the languages. Consequently, the trained networks can synthesize the English speakers' Korean speech and vice versa. Using this result, we propose a training framework to utilize information from a different language. To be specific, we pre-train a speech synthesis network using datasets from both high-resource language and low-resource language, then we fine-tune the network using the low-resource language dataset. Finally, we conducted more simulations on 10 different languages to show it is generally extendable to other languages.

研究动机与目标

探究多语言语音合成网络如何学习并表征不同语言间发音相似性的机制。
解决低资源语言TTS的挑战，即训练数据不足导致模型性能受限。
开发一种预训练框架，利用高资源语言数据通过共享音素表征来提升低资源语言TTS性能。
验证所提出方法在英语与韩语之外的多种语言对中的泛化能力。

提出的方法

在配对的英语和韩语文本-语音数据集上训练一个多语言多说话人的Tacotron模型，各语言共享音素嵌入词典。
对音素嵌入进行归一化处理，用于表征发音，从而实现跨语言音素相似性的比较。
模型采用说话人嵌入向量，将语音特征与语言内容解耦，支持多说话人和多语言的语音合成。
采用两阶段训练流程：在高资源语言数据（如英语）上进行预训练，并在低资源语言数据（如韩语）上进行微调，微调数据量有限。
通过Common Voice数据集将该框架扩展至10种额外语言，每种语言使用2小时微调数据。
通过人工偏好测试（7分制）和使用Google语音识别API的自动词错误率（WER）评估性能。

实验结果

研究问题

RQ1多语言语音合成网络如何表征来自不同语言的音素之间的发音相似性？
RQ2在高资源语言上进行预训练是否能提升低资源语言TTS模型的性能？
RQ3即使没有共享说话人，学习到的音素嵌入空间是否能反映跨语言的音素相似性？
RQ4所提出的预训练框架在多种语言对中的泛化能力如何？
RQ5仅使用有限的目标语言数据和高资源语言预训练，模型能否在低资源语言上生成自然的语音？

主要发现

当不同语言的发音相似时，其音素嵌入在嵌入空间中聚类更紧密，表明模型成功学习了跨语言的音素关系。
所提出的预训练框架（PA-HL）在主观偏好测试和客观WER方面均显著优于基线模型，在10小时微调数据下达到15.0%的WER。
在0.4小时微调数据条件下，PA-HL在54.0%的比较中优于基线模型。
在10小时微调条件下，PA-HL在偏好测试中的得票率达到68.3%，在所有测试的语言对中均优于其他模型。
该方法在10种额外语言上也表现出良好的泛化能力，PA-HL在偏好测试中优于PD-H的9种语言，证实了其广泛适用性。
在数据不足（如0.4小时）的模型中，注意力对齐存在困难，但PA-HL在所有数据设置下均保持了稳定的训练和性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。