QUICK REVIEW

[论文解读] Neural Analysis and Synthesis: Reconstructing Speech from Self-Supervised Representations

Hyeong-Seok Choi, Juheon Lee|arXiv (Cornell University)|Oct 27, 2021

Speech Recognition and Synthesis被引用 55

一句话总结

NANSY 提供了一个完全自监督的神经框架，用于分析和合成语音，使得在无需标注数据的情况下实现零样本语音转换、借助 Yingram 的音高移位以及时尺度修改。

ABSTRACT

We present a neural analysis and synthesis (NANSY) framework that can manipulate voice, pitch, and speed of an arbitrary speech signal. Most of the previous works have focused on using information bottleneck to disentangle analysis features for controllable synthesis, which usually results in poor reconstruction quality. We address this issue by proposing a novel training strategy based on information perturbation. The idea is to perturb information in the original input signal (e.g., formant, pitch, and frequency response), thereby letting synthesis networks selectively take essential attributes to reconstruct the input signal. Because NANSY does not need any bottleneck structures, it enjoys both high reconstruction quality and controllability. Furthermore, NANSY does not require any labels associated with speech data such as text and speaker information, but rather uses a new set of analysis features, i.e., wav2vec feature and newly proposed pitch feature, Yingram, which allows for fully self-supervised training. Taking advantage of fully self-supervised training, NANSY can be easily extended to a multilingual setting by simply training it with a multilingual dataset. The experiments show that NANSY can achieve significant improvement in performance in several applications such as zero-shot voice conversion, pitch shift, and time-scale modification.

研究动机与目标

在没有文本或说话人标签的情况下，使用高级分析特征来重建并可控地操控任意语音信号。
通过信息扰动来解耦语言信息、音高信息和说话人信息，同时保持重建质量。
在多语言环境中实现零样本语音转换、保留共振峰的音高移位以及时尺度修改等应用。

提出的方法

使用 wav2vec 2.0 特征（XLSR-53）作为语言无关的语言信息，来自中间层（24 层中的第 12 层）。
通过自监督说话嵌入网络从同一 wav2vec 表征中提取说话人信息。
引入 Yingram，一种来自 Yin 基差分函数并映射到类似 MIDI 的轴线的音高相关特征，用于可控音高。
通过将共振峰移位、音高随机化和参数化均衡器串联应用于 wav2vec 输入来进行信息扰动，并对 Yingram 进行保留共振峰/音高的扰动，以促进特征解耦。
将合成为两个生成器拆分：G_S（源，由 Yingram 驱动）和 G_F（滤波器，由 wav2vec 驱动），将它们的输出相加以形成梅尔频谱图。
使用 L1 损失加上投影条件 GAN 损失进行训练以提升自然度，随后使用 HiFi-GAN 进行波形重建。

实验结果

研究问题

RQ1在没有任何文本或说话人标签的情况下，NANSY 是否能够在解耦语言、音高和说话人信息的同时重建出高质量语音？
RQ2信息扰动是否相比基于瓶颈的方法在可控性和重建质量上具有更优表现？
RQ3模型是否能够在多语言环境中实现零样本语音转换和可控音高移位/时尺度修改，以及测试时自适应是否能提升未见语言的表现？

主要发现

NANSY 在无需标注数据的情况下实现高质量重建并提供对语音、音高和语速的可控操作。
Yingram 提供在具挑战性场景中优于 f0 的鲁棒音高表示，便于有效的音高控制和音高移位操作。
信息扰动消除了解耦与重建质量之间的权衡，在语音转换指标上优于基于瓶颈的方法。
TSA 通过在推理时仅自适应输入的 wav2vec 特征即可在未见语言上提升 CER，而无需重新训练模型。
NANSY 在零样本语音转换、多语言 VC 以及未见语言 VC 方面表现出色，在各设定下具有竞争力的 MOS 和高 SSIM。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。