QUICK REVIEW

[论文解读] Universal audio synthesizer control with normalizing flows

Philippe Esling, Naotake Masuda|arXiv (Cornell University)|Jul 1, 2019

Music Technology and Sound Studies参考文献 16被引用 34

一句话总结

该论文将合成器控制形式化为学习一个有序的潜在音频空间，并将其可逆映射到参数空间，使用带有归一化流的 VAE，并引入回归流与去纠缠流来实现参数推断、宏控件与音频预设的探索。

ABSTRACT

The ubiquity of sound synthesizers has reshaped music production and even entirely defined new music genres. However, the increasing complexity and number of parameters in modern synthesizers make them harder to master. Hence, the development of methods allowing to easily create and explore with synthesizers is a crucial need. Here, we introduce a novel formulation of audio synthesizer control. We formalize it as finding an organized latent audio space that represents the capabilities of a synthesizer, while constructing an invertible mapping to the space of its parameters. By using this formulation, we show that we can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model. To solve this new formulation, we rely on Variational Auto-Encoders (VAE) and Normalizing Flows (NF) to organize and map the respective auditory and parameter spaces. We introduce the disentangling flows, which allow to perform the invertible mapping between separate latent spaces, while steering the organization of some latent dimensions to match target variation factors by splitting the objective as partial density evaluation. We evaluate our proposal against a large set of baseline models and show its superiority in both parameter inference and audio reconstruction. We also show that the model disentangles the major factors of audio variations as latent dimensions, that can be directly used as macro-parameters. We also show that our model is able to learn semantic controls of a synthesizer by smoothly mapping to its parameters. Finally, we discuss the use of our model in creative applications and its real-time implementation in Ableton Live

研究动机与目标

促进对合成器音频能力的有序潜在表示。
在潜在音频空间与合成参数空间之间提供一个可逆映射。
实现同时的参数推断、宏控件学习与基于音频的预设探索。
引入回归流和去纠缠流以映射和组织潜在因子。
展示相较基线在音频重建与参数推断方面的改进。

提出的方法

将合成器控制形式化为学习通过可逆映射连接的两个潜在空间。
使用变分自编码器（VAE）学习有序的潜在音频空间z，并结合正则化流（Normalizing Flows）提高后验表达能力。
定义一个回归流，将潜在 z 映射到合成参数 v，并采用加性高斯噪声模型。
引入 Flow_post 和 Flow_cond 变体以优化映射及其不确定性。
通过去纠缠流扩展模型，使潜在维度与语义标签 t 对齐（在可用时进行监督学习）。
在 Diva 合成器的数据集（成对音频与可通过 MIDI 控制的参数集）上进行训练；在参数推断和音频重建方面与基线进行比较评估。

实验结果

研究问题

RQ1将有序的潜在音频空间映射为可逆的参数空间，是否能提升参数推断和音频重建？
RQ2回归流和去纠缠流是否能实现有效的宏控件学习和感知控制的语义维度？
RQ3所提出的方法在更多参数和域外音频下是否鲁棒？
RQ4是否可以通过基于音频的邻域探索利用潜在空间导航预设？
RQ5在实时应用场景（如 Ableton Live）中的性能如何？

主要发现

模型	16p 参数 MSE_n	16p 音频 SC	16p 音频 MSE	32p 参数 MSE_n	32p 音频 SC	32p 音频 MSE	域外音频 MSE
MLP	0.236 ± 0.44	6.226 ± 0.13	9.548 ± 3.1	0.218 ± 0.46	13.51 ± 3.1	36.48 ± 11.9	2.348 ± 2.1
CNN	0.171 ± 0.45	1.372 ± 0.29	6.329 ± 1.9	0.159 ± 0.46	19.18 ± 4.7	33.40 ± 9.4	2.311 ± 2.2
ResNet	0.191 ± 0.43	1.004 ± 0.35	6.422 ± 1.9	0.196 ± 0.49	10.37 ± 1.8	31.13 ± 9.8	2.322 ± 1.6
AE	0.181 ± 0.40	0.893 ± 0.13	5.557 ± 1.7	0.169 ± 0.40	5.566 ± 1.2	17.71 ± 6.9	1.225 ± 2.2
VAE	0.182 ± 0.32	0.810 ± 0.03	4.901 ± 1.4	0.153 ± 0.34	5.519 ± 1.4	16.85 ± 6.1	1.237 ± 1.3
WAE	0.159 ± 0.37	0.787 ± 0.05	4.979 ± 1.5	0.147 ± 0.33	3.967 ± 0.88	16.64 ± 6.2	1.194 ± 1.5
VAE_flow	0.199 ± 0.32	0.838 ± 0.02	4.975 ± 1.4	0.164 ± 0.34	1.418 ± 0.23	17.74 ± 6.8	1.193 ± 1.8
Flow_reg	0.197 ± 0.31	0.752 ± 0.05	4.409 ± 1.6	0.193 ± 0.32	0.911 ± 1.4	16.61 ± 7.4	1.101 ± 1.2
Flow_dis.	0.199 ± 0.31	0.831 ± 0.04	5.103 ± 2.1	0.197 ± 0.42	1.481 ± 1.8	17.12 ± 7.9	1.209 ± 1.4

Flow_reg 模型在评估的方法中实现了最佳的音频重建性能。
基于自编码器的模型（包括 Flow 变体）比直接的参数回归基线更好地捕捉音频结构，即使参数推断不那么准确。
将参数数量从16增至32时，基线方法的性能下降幅度大于流模型，其中 Flow 变体对更高维参数空间显示出最强的鲁棒性。
去纠缠流提供了对宏控件有用的显式语义维度，尽管相对于 Flow_reg，可能略微降低原始音频保真度。
潜在音频空间编码呈现出有意义的邻域；在某些情况下，从该空间解码参数对音频结构的保留优于直接的参数推断。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。