QUICK REVIEW

[論文レビュー] ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

Pengfei Zhu, Chao Pang|arXiv (Cornell University)|Feb 9, 2023

Music and Audio Processing被引用数 8

ひとこと要約

ERINIE-Musicは、自由形式のテキストプロンプトから直接音楽波形を生成する拡散モデルで、ウェブ由来のテキスト-音楽ペアで訓練され、従来手法と比較してテキスト-音楽の関連性と品質が優れている。

ABSTRACT

In recent years, the burgeoning interest in diffusion models has led to significant advances in image and speech generation. Nevertheless, the direct synthesis of music waveforms from unrestricted textual prompts remains a relatively underexplored domain. In response to this lacuna, this paper introduces a pioneering contribution in the form of a text-to-waveform music generation model, underpinned by the utilization of diffusion models. Our methodology hinges on the innovative incorporation of free-form textual prompts as conditional factors to guide the waveform generation process within the diffusion model framework. Addressing the challenge of limited text-music parallel data, we undertake the creation of a dataset by harnessing web resources, a task facilitated by weak supervision techniques. Furthermore, a rigorous empirical inquiry is undertaken to contrast the efficacy of two distinct prompt formats for text conditioning, namely, music tags and unconstrained textual descriptions. The outcomes of this comparative analysis affirm the superior performance of our proposed model in terms of enhancing text-music relevance. Finally, our work culminates in a demonstrative exhibition of the excellent capabilities of our model in text-to-music generation. We further demonstrate that our generated music in the waveform domain outperforms previous works by a large margin in terms of diversity, quality, and text-music relevance.

研究の動機と目的

paralell text-music waveform dataの不足を解消するため、弱教師ありのウェブ由来テキスト-音楽データセットを収集する。
unrestrictedなテキストプロンプトから音楽波形を生成するテキスト条件付き拡散モデルを開発する。
テキスト形式がテキスト-音楽の関連性に与える影響を、自由形式テキストと音楽タグを比較して調査する。
ERNIE-Musicによって生成された波形ベースの音楽が、多様で高品質、強いテキスト-音楽関連性を達成することを示す。

提案手法

テキスト y に条件付けられた音楽波形を生成する、デ denoised velocityを予測する連続時間拡散モデルを用いる。
multilingual ERNIEベースのテキストエンコーダでテキストをエンコードして、シーケンス表現 S と分類トークン s0 を得る。
テキスト表現を timestep埋め込みと融合して UNet ベースの拡散ネットワーク用のテキスト対応を形成する。
CSA（条件付き自己注意）を介して拡散バックボーンの自己注意に条件付けを統合し、グローバル情報を捉える。
velocity v_t = α_t ε − σ_t x をターゲットにするウェイト付き L2 損失（SNR+1重み付け）とコサインノイズスケジュール（α_t^2 + σ_t^2 = 1）で訓練する。
エンドツーエンドの自由形式テキストと音楽タグの2つのテキスト条件形式を実験し、結合操作（結合 vs 要素ごとの和）を分析する。

実験結果

リサーチクエスチョン

RQ1自由形式のテキスト条件付けは、従来の音楽タグ条件付けよりテキスト-音楽関連性を改善するか。
RQ2結合操作とテキスト表現が、テキストから音楽への拡散生成の品質と関連性にどう影響するか。
RQ3ウェブ由来のテキスト-音楽ペアで訓練された拡散モデルは、テキストプロンプトから多様で高品質な波形音楽を直接生成できるか。
RQ4多言語テキストエンコーダ（ERNIE-M）を使用することが、言語横断的なテキスト-音楽生成に与える影響は何か。
RQ5波形ベースの音楽生成は、既存のテキスト-音楽関連性と音響品質のアプローチと比較してどうか。

主な発見

方法	スコア ↑	Top Rate ↑	Bottom Rate ↓
TSM Wu and Sun (2022)	2.05	12%	27%
Mubert	1.85	37%	32%
our model	2.43	55%	12%

モデルは人間評価で競合法より高いテキスト-音楽関連性スコアを達成。
ERNIE-Musicによって生成された波形音楽は、最近の波形ベースのアプローチと比較して音楽品質が向上。
エンドツーエンドの自由形式テキスト条件付けは、事前定義された音楽タグに条件付けするよりテキスト-音楽関連性が高い。
UNet拡散バックボーンとCSAベースの条件付けを組み合わせたアーキテクチャは、テキストを音楽生成へ効果的にエンコードする。
定性的な結果は、楽器、リズム、感情（例：ピアノ、ギター、二胡；穏やかから速いテンポまで）の多様な出力を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。