QUICK REVIEW

[논문 리뷰] DDSP: Differentiable Digital Signal Processing

Jesse Engel, Lamtharn Hantrakul|arXiv (Cornell University)|2020. 01. 14.

Music and Audio Processing참고 문헌 37인용 수 77

한 줄 요약

이 논문은 미분가능 DSP 구성요소(발진기, 엔벨로프, 필터, 리버브)를 신경망과 통합한 DDSP 라이브러리를 소개하여 높은 충실도 음향 합성을 가능하게 하고 피치, 음량, 음색에 대한 해석 가능하고 모듈식 제어를 제공하되 강한 autoregressive 또는 adversarial 학습 없이 달성한다.

ABSTRACT

Most generative models of audio directly generate samples in one of two domains: time or frequency. While sufficient to express any signal, these representations are inefficient, as they do not utilize existing knowledge of how sound is generated and perceived. A third approach (vocoders/synthesizers) successfully incorporates strong domain knowledge of signal processing and perception, but has been less actively researched due to limited expressivity and difficulty integrating with modern auto-differentiation-based machine learning methods. In this paper, we introduce the Differentiable Digital Signal Processing (DDSP) library, which enables direct integration of classic signal processing elements with deep learning methods. Focusing on audio synthesis, we achieve high-fidelity generation without the need for large autoregressive models or adversarial losses, demonstrating that DDSP enables utilizing strong inductive biases without losing the expressive power of neural networks. Further, we show that combining interpretable modules permits manipulation of each separate model component, with applications such as independent control of pitch and loudness, realistic extrapolation to pitches not seen during training, blind dereverberation of room acoustics, transfer of extracted room acoustics to new environments, and transformation of timbre between disparate sources. In short, DDSP enables an interpretable and modular approach to generative modeling, without sacrificing the benefits of deep learning. The library is publicly available at https://github.com/magenta/ddsp and we welcome further contributions from the community and domain experts.

연구 동기 및 목표

오디오 합성을 위한 고전 DSP의 강한 귀납 편향을 활용하여 엔드투엔드 학습을 동기 부여하고 가능하게 한다.
발진기, 엔벨로프, 필터, 리버브를 신경망과 결합한 모듈식 차별화 가능한 도구 키트(DDSP)를 개발한다.
DDSP가 피치와 음량의 독립적 제어, 보지 않은 피치로의 외삽, 음색 전달을 가능하게 한다는 것을 보여준다.
DDSP가 자기회귀나 GAN 기반 기준과 비교하여 더 작은 모델로도 고품질 합성을 달성할 수 있음을 보인다.

제안 방법

A_k(n)=A(n)c_k(n)인 시간변화 기본 주파수 f0(n)와 고조파 진폭 A(n)을 사용한 미분 가능 진동기 기반의 가법 합성을 구현한다.
에너지 envelopes와 스무딩을 사용하여 느린 프레임율을 오디오 속도로 업샘플링하되 인공적인 왜곡을 피한다.
주파수 샘플링 방법으로 프레임마다 네트워크가 예측한 전달 함수 H_l을 갖는 시변 선형 위상 FIR 필터를 설계한다.
조합적 합성(Harmonic)과 필터드 노이즈(감산) 합성을 통해 Harmonic plus Noise 모델을 만든다.
긴 임펄스 응답을 모델링하기 위해 주파수 도메인 컨볼루션으로 미분 가능 리버브를 도입한다.
여러 FFT 크기에 대해 다중 스케일 스펙트럴 손실(L_i = ||S_i - S_i_hat||_1 + alpha ||log S_i - log S_i_hat||_1)로 오토인코더를 학습한다.

실험 결과

연구 질문

RQ1미분 가능 DSP 구성요소가 autoregressive 또는 adversarial 손실 없이도 고충실도 오디오 합성을 위한 엔드투엔드 학습을 가능하게 할 수 있는가?
RQ2모듈식 DDSP 아키텍처가 피치, 음량, 음색의 독립적 제어를 가능하게 하며 보지 않은 조건으로의 외삽을 지원하는가?
RQ3실내 음향(리버브)을 소스 생성으로부터 명시적으로 분리하여 블라인드 디리버버레이션 및 음향 전달과 같은 작업을 가능하게 할 수 있는가?
RQ4소형 DDSP 기반 오토인코더가 품질과 효율성 면에서 최첨단 신경 보코더와 경쟁할 수 있는가?
RQ5DDSP 구성요소가 해석 가능하고 제어 가능한 음악/오디오 생성에 어떻게 기여하는가?

주요 결과

모델	음량 L1	F0 L1	F0 이상치
WaveRNN (baseline)	0.10	1.00	0.07
DDSP Autoencoder (Supervised)	0.07	0.02	0.003
DDSP Autoencoder (Unsupervised)	0.09	0.80	0.04

DDSP 자동인코더는 WaveRNN 및 유사한 기준보다 훨씬 작은 모델로도 솔로 바이올린 및 NSynth 스타일 데이터의 재합성에 높은 충실도를 달성한다.
감독 학습 DDSP는 F0 L1 오차에서 WaveRNN보다 우수하고 NSynth에서 비교 모델보다 음량 오차가 더 작다.
지각적 CREPE 손실을 가진 비지도 DDSP는 명시적 피치 조건 없이도 의미 있는 F0와 음색을 학습하며 일부 기준보다 성능이 좋다.
피치와 음량의 독립적 제어가 f(t)와 l(t)의 분리된 컨디셔닝을 통해 입증되며, z(t)가 음색을 인코딩하고 보간이 매끄러운 지각적 전환을 보인다.
룸 임펄스 응답을 분리하고 학습된 리버브를 새로운 오디오에 적용하여 디리버버레이션 및 음향 전달을 달성하여 블라인드 디리버버레이션 및 환경 전달을 가능하게 한다.
음성의 F0/음량에 따른 음색 전달 및 바이올린 녹음의 리버브 전달을 이용하여 노래에서 바이올린으로의 음색 전달이 시연된다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.