QUICK REVIEW

[論文レビュー] EuleroDec: A Complex-Valued RVQ-VAE for Efficient and Robust Audio Coding

Luca Cerovaz, Michele Mancusi|arXiv (Cornell University)|Jan 24, 2026

Speech Recognition and Synthesis被引用数 0

ひとこと要約

この研究は、振幅と位相の結合を adversarial トレーニングや拡散後処理フィルターなしで保持する、音声コーディング向けの全エンドツーエンド複素数RVQ-VAE「EuleroDec」を6kbpsおよび12kbpsで提示する。

ABSTRACT

Audio codecs power discrete music generative modelling, music streaming and immersive media by shrinking PCM audio to bandwidth-friendly bit-rates. Recent works have gravitated towards processing in the spectral domain; however, spectrogram-domains typically struggle with phase modeling which is naturally complex-valued. Most frequency-domain neural codecs either disregard phase information or encode it as two separate real-valued channels, limiting spatial fidelity. This entails the need to introduce adversarial discriminators at the expense of convergence speed and training stability to compensate for the inadequate representation power of the audio signal. In this work we introduce an end-to-end complex-valued RVQ-VAE audio codec that preserves magnitude-phase coupling across the entire analysis-quantization-synthesis pipeline and removes adversarial discriminators and diffusion post-filters. Without GANs or diffusion we match or surpass much longer-trained baselines in-domain and reach SOTA out-of-domain performance. Compared to standard baselines that train for hundreds of thousands of steps, our model reducing training budget by an order of magnitude is markedly more compute-efficient while preserving high perceptual quality.

研究の動機と目的

スペクトル領域で位相情報を保持する robust な高品質音声コーディングを動機づける。
波形入力から波形再構成までを含む、完全にエンドツーエンドの複素数RVQ-VAE パイプラインを開発する。
敵対的識別子や拡散後フィルターへの依存を排除しつつ、ベースライン以上の性能を維持する。
長時間の学習ベースラインと比較して計算効率を大幅に改善した高速かつ安定したトレーニングを示す。

提案手法

複素数ドメインのみで運用し、複素畳み込み・正規化・活性化・アテンションを用いる。
多段階にわたる2048エントリのコードブックを用いたResidual Vector Quantization を適用して潜在表現を符号化する。
STFTベースの複素スペクトルを分離した実数ストリームに分解せずに処理して、振幅–位相結合を維持する。
2×2 のホワイトニングと複素軸アテンションを適用し、STFTと位相情報の代数構造を保持する。
Wirtinger 演算を用いて訓練し、敵対的トレーニングや拡散ベースの後処理フィルターを回避しつつ高い聴覚品質を達成する。

実験結果

リサーチクエスチョン

RQ1完全なエンドツーエンドの複素数ニューラルコーデックが、GANsや拡散後フィルターなしで低ビットレートでも最先端の音声品質を達成できるか。
RQ2分析–量子化–合成を通じて振幅–位相結合を維持することが、再構成忠実度と一般化能力を改善するか。
RQ36–12 kbps での複素数RVQ-VAE が、実数値または混合ドメインの手法と比べて性能と訓練効率にどのような利点をもたらすか。

主な発見

敵対的識別子や拡散後フィルターなしで、6および12 kbps の範囲で董内・域外の改善を達成。
2048エントリのコードブックと12段階の量子化を持つ複素数 RVQ-VAE を適用し、有効なコード利用と崩壊しないコードブックを達成。
最先端ベースラインと比較して収束が速く安定し、訓練予算を95%削減。
複素数ネットワークと Wirtinger 演算を用いて、全パイプラインを通して振幅–位相結合を保持し、高い聴覚品質を維持。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。