QUICK REVIEW

[論文レビュー] AudioGen: Textually Guided Audio Generation

Felix Kreuk, Gabriel Synnaeve|arXiv (Cornell University)|Sep 30, 2022

Music and Audio Processing被引用数 53

ひとこと要約

AudioGenは、離散的な音声表現を学習し、分類子フリーガイダンスとマルチストリーム戦略を用いたTransformerベースの言語モデルを活用して、テキスト条件付きの高忠実度音声を自回帰的に生成するモデルである。

ABSTRACT

We tackle the problem of generating audio samples conditioned on descriptive text captions. In this work, we propose AaudioGen, an auto-regressive generative model that generates audio samples conditioned on text inputs. AudioGen operates on a learnt discrete audio representation. The task of text-to-audio generation poses multiple challenges. Due to the way audio travels through a medium, differentiating ``objects'' can be a difficult task (e.g., separating multiple people simultaneously speaking). This is further complicated by real-world recording conditions (e.g., background noise, reverberation, etc.). Scarce text annotations impose another constraint, limiting the ability to scale models. Finally, modeling high-fidelity audio requires encoding audio at high sampling rate, leading to extremely long sequences. To alleviate the aforementioned challenges we propose an augmentation technique that mixes different audio samples, driving the model to internally learn to separate multiple sources. We curated 10 datasets containing different types of audio and text annotations to handle the scarcity of text-audio data points. For faster inference, we explore the use of multi-stream modeling, allowing the use of shorter sequences while maintaining a similar bitrate and perceptual quality. We apply classifier-free guidance to improve adherence to text. Comparing to the evaluated baselines, AudioGen outperforms over both objective and subjective metrics. Finally, we explore the ability of the proposed method to generate audio continuation conditionally and unconditionally. Samples: https://felixkreuk.github.io/audiogen

研究の動機と目的

高忠実度・可制御性・構成性を備えたテキストから音声生成を動機づける。
学習済みの離散音声表現上で動作する自回帰モデルを開発する。
未知のテキスト概念へ一般化するために事前学習済みのテキストエンコーダを活用する。
ガイダンスとオンザ-fly音声混合を通じてテキスト適合性と構成性を向上させる。
音声継続機能（条件付きおよび無条件）を実証する。

提案手法

再構成損失と知覚損失で訓練されたオートエンコーダ（E, Q, G）を用いて、生音声を離散シーケンスにエンコードする。
事前学習済みのT5テキストエンコーダとテキスト-音声クロスアテンション機構を介してテキストを条件とするTransformerベースの音声言語モデル（ALM）を訓練する。
サンプリング時に分類器フリーガイダンス（CFG）を適用して品質と多様性を均衡させる。
構成性と一般化を向上させるため、オンザ-flyのテキストおよび音声混合拡張を導入する。
シーケンス長を短縮し高速化するため、残余ベクトル量子化を用いたマルチストリーム音声入力を探索する。
客観指標（FAD, KL）と主観的MOS風評価で評価し、DiffSoundと比較し、CFGおよびマルチストリーム設計をアブレーションする。

実験結果

リサーチクエスチョン

RQ1記述的なテキストを条件に高忠実度な音声を自回帰モデルで生成できるか？
RQ2学習済みの離散音声表現を活用することで、未知のテキスト概念へより良く一般化できるか？
RQ3分類器フリーガイダンスは多様性を保ちつつテキスト適合性を向上させるか？
RQ4オンザ-flyのテキストと音声の混合は、生成音声の構成性と品質を向上させるか？
RQ5マルチストリームモデリングが忠実度、ビットレート、推論速度に及ぼす影響はどれか？

主な発見

モデル	パラメータ	データ増強	テキスト条件付け	OVL	Rel.	FAD	KL
Reference	-	-	-	92.08 ± 1.16	92.97 ± 0.85	-	-
DiffSound	400M	MBTG	CLIP	65.68 ± 1.58	55.91 ± 1.75	7.39	2.57
AudioGen-base	285M	-	T5-base	70.85 ± 1.06	63.23 ± 1.65	2.84	2.14
AudioGen-base Mix	285M	Mix	T5-base	71.68 ± 1.89	66.01 ± 1.79	3.13	2.09
AudioGen-large	1B	Mix	T5-large	71.85 ± 1.07	68.73 ± 1.61	1.82	1.69

AudioGen-baseは、パラメータ数を減らしつつ、客観指標および主観指標の両方でDiffSoundのベースラインを上回る。
AudioGen-largeは、客観的指標（FAD, KL）と主観的指標（OVL, Rel.）の両方でDiffSoundとAudioGen-baseをさらに上回る。
混合ベースの拡張は、混合なしの学習と比較してテキスト関連性（KL）と構成の複雑さを向上させる。
分類器フリーガイダンスはテキスト適合性とサンプル品質を高め、無条件サンプリングより良いトレードオフを実現する。
マルチストリーム構成は品質への影響が異なる速度向上をもたらし、単一ストリームのベースモデルが最良の客観スコアを、マルチストリームのバリアントが推論時間の利得を提供する。
音声継続実験は、テキスト条件付き生成がプロンプト長とテキストガイダンスの影響を受け続けることを示し、短いプロンプトでは条件付き継続を、短い音声プロンプトではプロンプト効果を強くする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。