QUICK REVIEW

[논문 리뷰] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

Luca Cerovaz, Michele Mancusi|arXiv (Cornell University)|2026. 01. 24.

Speech Recognition and Synthesis인용 수 0

한 줄 요약

이 연구는 6 및 12 kbps에서 진폭-위상 결합을 보존하면서 적대적 학습이나 확산 포스트 필터 없이도 완전한 엔드투엔드 복소수 RVQ-VAE 기반 오디오 코딩인 EuleroDec를 제시합니다.

ABSTRACT

Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.

연구 동기 및 목표

스펙트럼 도메인에서 위상 정보를 보존하는 강인하고 고품질의 오디오 코딩을 달성하는 동기를 제공한다.
waveform 입력에서 waveform 재구성까지 완전한 엔드투엔드 복소수 RVQ-VAE 파이프라인을 개발한다.
적대적 판별기나 확산 포스트필터에 대한 의존성을 제거하면서도 베이스라인을 유지하거나 능가한다.
오랜 학습 기반 대비 높은 계산 효율성으로 빠르고 안정적인 학습을 보여준다.

제안 방법

복소수 도메인에서 완전히 작동하며, 복소수 합성 컨볼루션, 정규화, 활성화 및 어텐션을 사용한다.
다중 단계에 걸쳐 2048-엔트리 코드북을 사용하는 잔차 벡터 양자화(residual vector quantization)를 적용하여 잠재 표현을 인코딩한다.
실수 스트림으로 분해하지 않고 STFT 기반의 복소 스펙트럼을 처리함으로써 진폭-위상 결합을 유지한다.
STFT의 대수적 구조와 위상 정보를 보존하기 위해 2×2 화이트닝과 복소수 축 어텐션을 적용한다.
Wirtinger 계산으로 학습하고 고감각 품질을 달성하면서 적대적 학습 또는 확산 기반 포스트 필터를 피한다.

실험 결과

연구 질문

RQ1GANs 또는 확산 포스트필터 없이도 완전한 엔드투엔드 복소수 신경 코덱이 저비트레이트에서 최첨단 오디오 품질을 달성할 수 있는가?
RQ2분석-양자화-합성 전 과정에서 진폭-위상 결합을 유지하는 것이 재구성 충실도와 일반화에 도움이 되는가?
RQ36–12 kbps에서 복소수 RVQ-VAE가 실수값 또는 혼합 도메인 접근법에 비해 성능 및 학습 효율성에 어떤 이점이 있는가?

주요 결과

6 및 12 kbps에서 도메인 내외의 개선을 달성하며 적대적 판별기나 확산 포스트필터 없이도 작동한다.
2048-엔트리 코드북과 12 양자화 단계를 갖춘 복소수 RVQ-VAE를 적용해 효율적인 코드 활용과 비수축 코드북을 달성한다.
최신 기준선에 비해 수렴 속도가 빠르고 안정적이며 학습 예산을 95% 절감한다.
전체 파이프라인에서 진폭–위상 결합을 보존하는 복소수 네트워크와 Wirtinger 계산을 사용해 높은 지각 품질을 유지한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.