QUICK REVIEW

[논문 리뷰] RAVE: A variational autoencoder for fast and high-quality neural audio synthesis

Antoine Caillon, Philippe Esling|arXiv (Cornell University)|2021. 11. 09.

Speech and Audio Processing인용 수 40

한 줄 요약

RAVE는 두 단계의 VAE와 대립적 미세조정 및 다대역 파형 분해를 도입하여 CPU에서 약 20x 실시간 수준으로 48kHz 고품질 오디오 합성을 달성합니다.

ABSTRACT

Deep generative models applied to audio have improved by a large margin the state-of-the-art in many speech and music related tasks. However, as raw waveform modelling remains an inherently difficult task, audio generative models are either computationally intensive, rely on low sampling rates, are complicated to control or restrict the nature of possible signals. Among those models, Variational AutoEncoders (VAE) give control over the generation by exposing latent variables, although they usually suffer from low synthesis quality. In this paper, we introduce a Realtime Audio Variational autoEncoder (RAVE) allowing both fast and high-quality audio waveform synthesis. We introduce a novel two-stage training procedure, namely representation learning and adversarial fine-tuning. We show that using a post-training analysis of the latent space allows a direct control between the reconstruction fidelity and the representation compactness. By leveraging a multi-band decomposition of the raw waveform, we show that our model is the first able to generate 48kHz audio signals, while simultaneously running 20 times faster than real-time on a standard laptop CPU. We evaluate synthesis quality using both quantitative and qualitative subjective experiments and show the superiority of our approach compared to existing models. Finally, we present applications of our model for timbre transfer and signal compression. All of our source code and audio examples are publicly available.

연구 동기 및 목표

무거운 자기회귀 생성 없이 빠르고 고품질의 신경 오디오 합성을 촉진한다.
재구성 충실도와 잠재 공간의 압축성을 균형시키는 VAE 기반 프레임워크를 개발한다.
다중대역 파형 분해를 사용하여 계산 비용이 낮은 48kHz 오디오 합성을 가능하게 한다.
학습 후 잠재 공간 분석 방법을 제공하여 정보량이 많은 잠재 차원을 식별한다.
음색 전이와 신호 압축에서의 응용을 시연한다.

제안 방법

표현 학습을 위한 일반 VAE를 먼저 학습시키고, 그다음 대립적 생성 목적을 사용하여 미세조정하는 두 단계 학습 절차를 제안한다.
원시 파형의 다중대역 분해를 사용하여 시간적 차원을 감소시키고 48kHz 합성을 가능하게 한다.
표현 학습(단계 1) 중 인코더를 다중 스케일 스펙트럴 손실로 최적화한다.
2단계에서 인코더를 고정하고 디코더를 힌지-GAN 목적 함수와 스펙트럴 및 특징 매칭 손실로 학습시킨다.
학습 후 특이값 분해(SVD)를 통해 정보성 잠재 차원과 비정보성 잠재 차원을 분리하여 가변 해상도 재구성을 가능하게 한다.

실험 결과

연구 질문

RQ1VAE 기반 모델이 CPU에서 실시간 또는 실시간에 근접한 성능으로 고품질의 48kHz 오디오 합성을 달성할 수 있는가?
RQ2학습 후 잠재 공간을 어떻게 분석하고 가지치기하여 재구성 충실도와 표현의 압축성을 균형시킬 수 있는가?
RQ3표현 학습 후의 대립적 미세조정이 학습된 잠재 구조를 손상시키지 않으면서 지각적 품질을 향상시키는가?
RQ4다중대역 파형 분해가 관리 가능한 계산 비용으로 높은 샘플링 속도의 합성을 가능하게 하는가?
RQ5모델이 감독 대상 없이도 음색 전이와 신호 압축이 가능한가?

주요 결과

RAVE는 표준 노트북 CPU에서 48kHz 오디오 합성과 20x 실시간 속도를 달성한다.
현 문자열 데이터에 대한 15-시험 MOS 연구에서, RAVE의 점수는 3.01으로 NSynth의 2.68, SING의 1.15를 기록했다.
RAVE는 17.6M 파라미터를 사용하며 기준 모델들보다 상당히 적다.
16-band 다중대역 분해는 낮은 계산 부담으로 고품질의 48kHz 합성을 가능하게 한다.
SVD를 이용한 학습 후 잠재 공간 분석은 재구성 품질을 제어하면서 잠재 차원을 크게 축소하는 충실도 매개변수 f를 산출한다.
RAVE는 음색 전이 및 잠재 공간 기반 신호 압축을 지원한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.