[论文解读] Towards Transfer Learning for End-to-End Speech Synthesis from Deep Pre-Trained Language Models
本文提出在Tacotron-2中并行集成BERT表征作为文本编码器,利用迁移学习提升端到端文本到语音合成性能。通过将BERT的深层上下文嵌入与Tacotron-2编码器输出在每个时间步拼接,模型实现了更快的训练收敛速度和显著减少的合成后 babbling 现象,尽管在自然度和客观指标方面与基线相比改善有限。
Modern text-to-speech (TTS) systems are able to generate audio that sounds almost as natural as human speech. However, the bar of developing high-quality TTS systems remains high since a sizable set of studio-quality pairs is usually required. Compared to commercial data used to develop state-of-the-art systems, publicly available data are usually worse in terms of both quality and size. Audio generated by TTS systems trained on publicly available data tends to not only sound less natural, but also exhibits more background noise. In this work, we aim to lower TTS systems' reliance on high-quality data by providing them the textual knowledge extracted by deep pre-trained language models during training. In particular, we investigate the use of BERT to assist the training of Tacotron-2, a state of the art TTS consisting of an encoder and an attention-based decoder. BERT representations learned from large amounts of unlabeled text data are shown to contain very rich semantic and syntactic information about the input text, and have potential to be leveraged by a TTS system to compensate the lack of high-quality data. We incorporate BERT as a parallel branch to the Tacotron-2 encoder with its own attention head. For an input text, it is simultaneously passed into BERT and the Tacotron-2 encoder. The representations extracted by the two branches are concatenated and then fed to the decoder. As a preliminary study, although we have not found incorporating BERT into Tacotron-2 generates more natural or cleaner speech at a human-perceivable level, we observe improvements in other aspects such as the model is being significantly better at knowing when to stop decoding such that there is much less babbling at the end of the synthesized audio and faster convergence during training.
研究动机与目标
- 通过利用预训练语言模型,降低端到端文本到语音(TTS)系统中对高质量数据的依赖。
- 通过注入来自BERT的丰富语言知识,减少对昂贵的、专业录音室质量的<text, audio>配对数据的依赖。
- 在不牺牲语音自然度的前提下,提升训练效率和推理行为,特别是停止标记预测性能。
- 探索在低资源设置下,使用公开可用数据时,预训练语言表征是否能提升TTS性能。
提出的方法
- 将BERT作为与Tacotron-2编码器并行的分支编码器,处理相同的输入文本。
- 从BERT的最终层提取每个输入词元的上下文表征。
- 在每个时间步将BERT的表征与Tacotron-2编码器输出拼接。
- 在解码器中使用独立的注意力头,分别关注Tacotron-2编码器和BERT编码器的表征。
- 将拼接后的上下文向量输入解码器的自回归LSTM,用于频谱特征预测。
- 使用标准Tacotron-2损失函数,端到端联合训练整个模型,并微调TTS和BERT组件。
实验结果
研究问题
- RQ1当在公开可用的、质量较低的数据上训练时,像BERT这样的预训练语言模型表征是否能提升端到端TTS性能?
- RQ2与标准Tacotron-2相比,集成BERT表征是否能实现更快的TTS训练收敛?
- RQ3集成BERT表征是否能减少常见的TTS伪影,如合成后的babbling或过生成现象?
- RQ4BERT表征在多大程度上提升了模型停止解码的预测能力?
- RQ5尽管自然度变化不大,客观指标(如MCD13、FFE)是否表现出可测量的提升?
主要发现
- 如图2所示的训练曲线表明,所提模型在训练过程中收敛速度显著快于基线Tacotron-2。
- 集成BERT的模型表现出大幅减少的合成后babbling现象,解码器能更准确地学习停止解码的时机。
- 尽管收敛更快且停止预测更准确,但在训练结束时,感知质量或MCD13/FFE指标与基线相比无统计学显著提升。
- FFE指标与语音质量的相关性优于MCD13,后者表现出波动且与感知自然度相关性差。
- 注意力可视化显示,BERT的注意力模式较不集中,更发散,表明其提供的是补充性而非主导性的信息。
- BERT编码器的表征在注意力对齐中影响力较弱,表明主要的文本到声学映射仍由Tacotron-2编码器学习到的表征主导。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。