QUICK REVIEW

[論文レビュー] Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis

Hubert Siuzdak|arXiv (Cornell University)|Jun 1, 2023

Music and Audio Processing被引用数 11

ひとこと要約

Vocosは等方的でアップサンプリングを行わない生成器と逆STFTを用いてFourierスペクトル係数を直接生成し、高速・高品質な音声合成を実現します。最新の vocoding と同等の品質を保ちつつ、時系列ドメインのGANsよりはるかに高速です。

ABSTRACT

Recent advancements in neural vocoding are predominantly driven by Generative Adversarial Networks (GANs) operating in the time-domain. While effective, this approach neglects the inductive bias offered by time-frequency representations, resulting in reduntant and computionally-intensive upsampling operations. Fourier-based time-frequency representation is an appealing alternative, aligning more accurately with human auditory perception, and benefitting from well-established fast algorithms for its computation. Nevertheless, direct reconstruction of complex-valued spectrograms has been historically problematic, primarily due to phase recovery issues. This study seeks to close this gap by presenting Vocos, a new model that directly generates Fourier spectral coefficients. Vocos not only matches the state-of-the-art in audio quality, as demonstrated in our evaluations, but it also substantially improves computational efficiency, achieving an order of magnitude increase in speed compared to prevailing time-domain neural vocoding approaches. The source code and model weights have been open-sourced at https://github.com/gemelo-ai/vocos.

研究の動機と目的

知覚音質を保つフーリエベースのニューラルボコーダの動機づけと開発。
波形再構築に逆STFTを用いることで、学習可能なアップサンプリング層を排除する。
フーリエ領域の空間パターンをモデル化するためにConvNeXtブロックを活用する。
推論速度を大幅に向上させつつ、客観的・主観的音質で競争力がある、あるいはそれを上回る成果を達成する。

提案手法

Vocosを提案する。STFT係数(m, p)を出力し、逆STFTを用いて波形を再構築するGANベースのボコーダ。
等方的解像度を維持するConvNeXtベースの生成器を用い、振幅と位相（M, x, y を介して）を出力して複素STFT係数を形成する。
位相は wrapされた角度 ϕ = atan2(y, x) で表現し、適切な (-π, π] の包絡を保証する。
ヒンジ対立損失、メルスペクトログラム再構成損失、そして複数ディスクリミネータ設定（MPD および MRD）に対する特徴量マッチング損失で訓練する。
転置畳み込みを用いず、ISTFTでアップサンプリングを行い、等方性アーキテクチャを実現し、エイリアシングを低減する。

実験結果

リサーチクエスチョン

RQ1GANがフーリエ領域係数を直接モデリングすることで、時間領域のボコーダと同等の高品質な音声を再現できるか。
RQ2アップサンプリング層を避け、ISTFTを用いることで、知覚品質を損なうことなく大幅な速度向上が得られるか。
RQ3フーリエ領域のボコーディングにおけるConvNeXtと従来のResBlockブロックの影響は何か。
RQ4フーリエベースのボコーダは、EnCodecのようなニューラ codecsと客観的・知覚的指標でどのように比較されるか。
RQ5位相包絡戦略は、知覚的に信頼できる複素スペクトログラムを再現するのに十分か。

主な発見

モデル	UTMOS (↑)	VISQOL (↑)	PESQ (↑)	V/UV F1 (↑)	周期性 (↓)
Ground truth	4.058	–	–	–	–
HiFi-GAN	3.669	4.57	3.093	0.9457	0.129
iSTFTNet	3.564	4.56	2.942	0.9372	0.141
BigVGAN	3.749	4.65	3.693	0.9557	0.108
Vocos	3.734	4.66	3.70	0.9582	0.101
w/o ConvNeXt	3.658	4.65	3.528	0.9534	0.109

VocosはLibriTTS由来の評価において、PESQ、VISQOLなどの知覚指標で最先端またはほぼ最先端を達成。
VocosはHiFi-GAN、iSTFTNet、BigVGANなどの時間領域GANより周期性アーチファクトをより効果的に軽減。
VocosはConvNeXtブロックを前面に出す。これをResBlocksに置換すると性能がわずかに劣化。
推論速度は大幅に向上: ISTFTベースのアップサンプリングのため、VocosはGPU上でHiFi-GANの約13x、BigVGANの約70x faster。
MUSDB18および分布外の歌唱ボイスで、Vocosは競合モデルより高い知覚品質（VISQOL）を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。