QUICK REVIEW

[论文解读] SING: Symbol-to-Instrument Neural Generator

Alexandre Défossez, Neil Zeghidour|arXiv (Cornell University)|Oct 23, 2018

Speech and Audio Processing被引用 26

一句话总结

SING 引入了一种轻量级、非自回归的神经音频合成器，通过一次性预测整个1024样本的音频帧，从乐器、音高和力度条件生成高保真度的音乐音符。它在NSynth数据集上通过在对数频谱图上使用新型频谱损失，实现了最先进的感知质量，训练速度比基于WaveNet的自编码器基线快32倍，推理速度更快2,500倍。

ABSTRACT

Recent progress in deep learning for audio synthesis opens the way to models that directly produce the waveform, shifting away from the traditional paradigm of relying on vocoders or MIDI synthesizers for speech or music generation. Despite their successes, current state-of-the-art neural audio synthesizers such as WaveNet and SampleRNN suffer from prohibitive training and inference times because they are based on autoregressive models that generate audio samples one at a time at a rate of 16kHz. In this work, we study the more computationally efficient alternative of generating the waveform frame-by-frame with large strides. We present SING, a lightweight neural audio synthesizer for the original task of generating musical notes given desired instrument, pitch and velocity. Our model is trained end-to-end to generate notes from nearly 1000 instruments with a single decoder, thanks to a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms. On the generalization task of synthesizing notes for pairs of pitch and instrument not seen during training, SING produces audio with significantly improved perceptual quality compared to a state-of-the-art autoencoder based on WaveNet as measured by a Mean Opinion Score (MOS), and is about 32 times faster for training and 2, 500 times faster for inference.

研究动机与目标

开发一种计算高效的神经音频合成器，绕过自回归生成，实现更快的训练和推理速度。
实现单模型端到端训练，支持近1,000种乐器、65个音高和5种力度。
在显著降低计算成本的同时，提升相对于现有自编码器方法的感知质量。
通过一种新型频谱损失函数，在低维潜在空间中实现音高、乐器和力度的解耦。

提出的方法

该模型使用一个3层LSTM，将乐器、音高和力度编码为每个音频帧的潜在嵌入。
一个单层4层卷积解码器通过一次前向传播，从潜在嵌入生成1024样本的音频帧。
一种新型频谱损失计算生成波形与目标波形对数功率频谱图之间的1-范数，实现相位无关的训练。
LSTM通过一个预训练的卷积自编码器进行初始化，该自编码器使用相同的频谱损失重建原始波形。
通过反向传播频谱损失进行端到端训练，实现编码器和解码器的联合优化。
通过人类感知测试（MOS）和ABX相似性任务对模型进行评估，以衡量自然度和保真度。

实验结果

研究问题

RQ1非自回归、帧级音频生成模型能否实现与自回归WaveNet模型相当的感知质量？
RQ2基于对数频谱图的频谱损失是否能实现有效的、无需后处理的相位无关训练？
RQ3单个解码器模型在推理过程中能否泛化到训练期间未见过的乐器与音高组合？
RQ4该模型在潜在表示中对音高、乐器和力度的解耦程度如何？

主要发现

SING 的平均意见得分（MOS）为 3.55 ± 0.23，显著高于基于WaveNet的自编码器基线（2.85 ± 0.24），表明其感知质量更优。
SING 的训练速度比基线快32倍（120小时*GPU vs. 3840小时*GPU），推理速度更快2,500倍（512秒/秒 vs. 0.2秒/秒）。
在ABX相似性测试中，69.7%的人类判断更偏好SING的输出，表明其与真实音符的保真度更高。
该模型实现了2133倍的压缩因子，意味着其用远少于原始波形的潜在维度表示音频序列。
SING 模型大小（243 MB）比基于WaveNet的基线（948 MB）小4倍以上，显著提升了内存效率。
该模型成功合成了训练期间未见过的乐器-音高组合的音乐音符，证明了其强大的泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。