QUICK REVIEW

[논문 리뷰] Universal audio synthesizer control with normalizing flows

Philippe Esling, Naotake Masuda|arXiv (Cornell University)|2019. 07. 01.

Music Technology and Sound Studies참고 문헌 16인용 수 34

한 줄 요약

본 논문은 합성기 제어를 파라미터 공간에 역으로 매핑되는 조직화된 잠재 오디오 공간의 학습으로 형식화하며, 정상화 흐름(normalizing flows)과 VAEs를 사용하고 회귀 및 해석 가능한 흐름을 도입해 파라미터 추론, 매크로 컨트롤 및 오디오 기반 프리셋 탐색을 가능하게 한다.

ABSTRACT

The ubiquity of sound synthesizers has reshaped music production and even entirely defined new music genres. However, the increasing complexity and number of parameters in modern synthesizers make them harder to master. Hence, the development of methods allowing to easily create and explore with synthesizers is a crucial need. Here, we introduce a novel formulation of audio synthesizer control. We formalize it as finding an organized latent audio space that represents the capabilities of a synthesizer, while constructing an invertible mapping to the space of its parameters. By using this formulation, we show that we can address simultaneously automatic parameter inference, macro-control learning and audio-based preset exploration within a single model. To solve this new formulation, we rely on Variational Auto-Encoders (VAE) and Normalizing Flows (NF) to organize and map the respective auditory and parameter spaces. We introduce the disentangling flows, which allow to perform the invertible mapping between separate latent spaces, while steering the organization of some latent dimensions to match target variation factors by splitting the objective as partial density evaluation. We evaluate our proposal against a large set of baseline models and show its superiority in both parameter inference and audio reconstruction. We also show that the model disentangles the major factors of audio variations as latent dimensions, that can be directly used as macro-parameters. We also show that our model is able to learn semantic controls of a synthesizer by smoothly mapping to its parameters. Finally, we discuss the use of our model in creative applications and its real-time implementation in Ableton Live

연구 동기 및 목표

합성기의 오디오 능력에 대한 조직화된 잠재 표현의 필요성을 제시한다.
잠재 오디오 공간과 합성 매개변수 공간 간의 가역적 매핑을 제공한다.
동시적인 매개변수 추론, 매크로 컨트롤 학습, 및 오디오 기반 프리셋 탐색을 가능하게 한다.
잠재 요인을 매핑하고 조직화하기 위해 회귀 흐름과 해석 흐름을 도입한다.
기준선 대비 향상된 오디오 재구성 및 매개변수 추론을 시연한다.

제안 방법

합성기 제어를 두 개의 잠재 공간이 가역 매핑으로 연결되는 문제로 Formalize한다.
VAE를 사용해 조직화된 잠재 오디오 공간 z를 학습하고, 포스트 표현력을 증가시키기 위해 Normalizing Flows를 결합한다.
잠재 z를 합성 매개변수 v로 매핑하는 Additive Gaussian 잡음 모델을 갖는 회귀 흐름(regression flow)을 정의한다.
매핑과 불확실성을 최적화하기 위해 Flow_post 및 Flow_cond 변형을 도입한다.
가용한 경우 감독적 학습(supervised)으로 잠재 차원을 의미 태그 t와 정렬하기 위해 해석 흐름(disentangling flows)을 확장한다.
오디오와 MIDI-제어 가능한 매개변수 세트의 페어로 구성된 Diva 합성기 데이터셋에서 학습하고, 매개변수 추론 및 오디오 재구성에서 기준선과 비교 평가한다.

실험 결과

연구 질문

RQ1조직화된 잠재 오디오 공간이 파라미터 공간에 역으로 매핑될 수 있다면 매개변수 추론과 오디오 재구성이 향상되는가?
RQ2회귀 흐름과 해석 흐름이 매크로 컨트롤 학습 및 지각적 제어를 위한 의미 있는 차원을 제공하는가?
RQ3제안된 접근법이 더 많은 매개변수 수와 도메인 외 오디오에 대해 강건한가?
RQ4잠재 공간을 통해 프리셋을 탐색하기 위해 오디오 기반 이웃 탐색이 사용될 수 있는가?
RQ5실시간 애플리케이션 맥락(예: Ableton Live)에서의 성능은 어떠한가?

주요 결과

모델	16p 파라미터 MSE_n	16p 오디오 SC	16p 오디오 MSE	32p 파라미터 MSE_n	32p 오디오 SC	32p 오디오 MSE	도메인 외 오디오 MSE
MLP	0.236 ± 0.44	6.226 ± 0.13	9.548 ± 3.1	0.218 ± 0.46	13.51 ± 3.1	36.48 ± 11.9	2.348 ± 2.1
CNN	0.171 ± 0.45	1.372 ± 0.29	6.329 ± 1.9	0.159 ± 0.46	19.18 ± 4.7	33.40 ± 9.4	2.311 ± 2.2
ResNet	0.191 ± 0.43	1.004 ± 0.35	6.422 ± 1.9	0.196 ± 0.49	10.37 ± 1.8	31.13 ± 9.8	2.322 ± 1.6
AE	0.181 ± 0.40	0.893 ± 0.13	5.557 ± 1.7	0.169 ± 0.40	5.566 ± 1.2	17.71 ± 6.9	1.225 ± 2.2
VAE	0.182 ± 0.32	0.810 ± 0.03	4.901 ± 1.4	0.153 ± 0.34	5.519 ± 1.4	16.85 ± 6.1	1.237 ± 1.3
WAE	0.159 ± 0.37	0.787 ± 0.05	4.979 ± 1.5	0.147 ± 0.33	3.967 ± 0.88	16.64 ± 6.2	1.194 ± 1.5
VAE_flow	0.199 ± 0.32	0.838 ± 0.02	4.975 ± 1.4	0.164 ± 0.34	1.418 ± 0.23	17.74 ± 6.8	1.193 ± 1.8
Flow_reg	0.197 ± 0.31	0.752 ± 0.05	4.409 ± 1.6	0.193 ± 0.32	0.911 ± 1.4	16.61 ± 7.4	1.101 ± 1.2
Flow_dis.	0.199 ± 0.31	0.831 ± 0.04	5.103 ± 2.1	0.197 ± 0.42	1.481 ± 1.8	17.12 ± 7.9	1.209 ± 1.4

Flow_reg 모델이 평가된 방법들 중에서 가장 우수한 오디오 재구성 성능을 달성한다.
AE 기반 모델(Flow 변형 포함)은 직접 매개변수 회귀 기준선보다 오디오 구조를 더 잘 포착했으며, 매개변수 추론이 더 정확하지 않더라도 그렇다.
매개변수 수를 16에서 32로 증가시킬 때 기준선은 흐름보다 성능이 더 저하되는 경향이 있으며, Flow 변형이 더 높은 차원의 매개변수 공간에 대해 가장 강한 탄력성을 보인다.
해석 흐름은 매크로 컨트롤에 유용한 명시적 의미 차원을 제공하지만, Flow_reg에 비해 원시 오디오 충실성은 약간 감소할 수 있다.
잠재 오디오 공간 인코딩은 의미 있는 이웃을 형성했으며, 이 공간에서의 매개변수 디코딩은 일부 경우 직접 매개변수 추론보다 오디오 구조를 더 잘 보존했다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.