QUICK REVIEW

[논문 리뷰] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak|arXiv (Cornell University)|2023. 06. 01.

Music and Audio Processing인용 수 11

한 줄 요약

Vocos는 등방성의 비업샘플링 제너레이터와 역 STFT를 이용한 푸리에 스펙트럼 계수를 직접 생성하여 빠르고 고품질의 오디오 합성을 달성하며, 시계열 GAN에 비해 훨씬 빠르면서도 최첨단 보코딩과 같은 성능을 보인다.

ABSTRACT

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

연구 동기 및 목표

Perceptual 오디오 품질을 보존하는 Fourier 기반 신경 보코더를 동기화하고 개발한다.
웨이브폼 재구성을 위해 learnable upsampling 층을 제거하고 ISTFT를 사용한다.
푸리에 도메인의 공간 패턴을 모델링하기 위해 ConvNeXt 블록을 활용한다.
추정 속도를 크게 높이면서도 객관적, 주관적 오디오 품질을 경쟁력 있게 달성하거나 우수하게 달성한다.

제안 방법

Vocos를 제안한다. Vocos는 STFT 계수(m, p)를 출력하고 역 STFT를 통해 파형을 재구성하는 GAN 기반 보코더이다.
등방성 해상도를 유지하고 크기와 위상(M, x, y)을 통해 복소수 STFT 계수를 형성하는 ConvNeXt 기반 제너레이터를 사용한다.
위상은 올바른 (-π, π] 래핑을 보장하는 래핑된 각도 ϕ = atan2(y, x)로 표현한다.
히지 adversarial 손실, 멜 스펙트로그램 재구성 손실, 다중 구분기( MPD 및 MRD ) 간의 특징 매칭 손실로 학습한다.
전치 합성(convTranspose) 없이 ISTFT로 업샘플링하고 등방성 아키텍처를 얻으며 에일리어싱 아티팩트를 줄인다.

실험 결과

연구 질문

RQ1GAN이 Fourier 도메인 계수를 직접 모델링해도 시점 도메인 보코더와 동일하게 고품질의 오디오를 재현할 수 있는가?
RQ2Upsampling 층을 피하고 ISTFT를 사용하면 인지적 품질을 해치지 않으면서도 현저한 속도 향상이 있는가?
RQ3 Fourier 도메인 보코딩에서 ConvNeXt와 전통적 ResBlock 블록의 영향은 무엇인가?
RQ4Fourier 기반 보코더가 EnCodec 유사 신경 코덱과 비교해 객관적 및 지각적 지표에서 어떤 차이가 있는가?
RQ5위상 래핑 전략이 지각적으로 충실한 복잡한 스펙트로그램을 복구하는 데 충분한가?

주요 결과

모델	UTMOS (↑)	VISQOL (↑)	PESQ (↑)	V/UV F1 (↑)	주기성 (↓)
Ground truth	4.058	–	–	–	–
HiFi-GAN	3.669	4.57	3.093	0.9457	0.129
iSTFTNet	3.564	4.56	2.942	0.9372	0.141
BigVGAN	3.749	4.65	3.693	0.9557	0.108
Vocos	3.734	4.66	3.70	0.9582	0.101
w/o ConvNeXt	3.658	4.65	3.528	0.9534	0.109

Vocos는 LibriTTS 유래 평가에서 지각적 지표(PESQ, VISQOL)에서 최첨단 또는 근접 최첨단 성능을 달성한다.
Vocos는 HiFi-GAN, iSTFTNet, BigVGAN과 같은 시점 도메인 GAN보다 주기성 아티팩트를 더 효과적으로 완화한다.
Vocos는 ConvNeXt 블록을 전면적으로 활용하며 이를 ResBlock으로 대체하면 성능이 약간 저하된다.
추론 속도가 훨씬 빠르다: Vocos는 GPU에서 HiFi-GAN보다 약 13배, BigVGAN보다 약 70배 더 빠르며 ISTFT 기반 업샘플링 덕분이다.
MUSDB18 및 out-of-distribution 싱잉 보이스에서 Vocos가 경쟁 모델보다 인지적 품질(VISQOL)이 더 높다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.