QUICK REVIEW

[論文レビュー] MuChoMusic dataset

Weck, Benno, Timo I. Denk|arXiv (Cornell University)|Jan 26, 2023

Music and Audio Processing被引用数 180

ひとこと要約

MusicLM はテキスト記述から24 kHzで高品質な音楽を生成し、メロディ条件付けをサポートし、評価のために MusicCaps（5.5k music-text pairs）を導入します。

ABSTRACT

MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models MuChoMusic is a benchmark designed to evaluate music understanding in multimodal language models focused on audio. It includes 1,187 multiple-choice questions validated by human annotators, based on 644 music tracks from two publicly available music datasets. These questions cover a wide variety of genres and assess knowledge and reasoning across several musical concepts and their cultural and functional contexts. The benchmark provides a holistic evaluation of five open-source models, revealing challenges such as over-reliance on the language modality and highlighting the need for better multimodal integration. Note on Audio Files This dataset comes without audio files. The audio files can be downloaded from two datasets: SongDescriberDataset (SDD) and MusicCaps. Please see the code repository for more information on how to download the audio. Citation If you use this dataset, please cite our paper: @inproceedings{weck2024muchomusic, title={MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models}, author={Weck, Benno and Manco, Ilaria and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Bogdanov, Dmitry}, booktitle = {Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)}, year={2024} }

研究の動機と目的

説明テキストから高品質で長編の音楽生成を動機づける。
非ラベル音声データを学習データとして共有音楽-テキスト埋め込み空間を活用して、ロバストなテキストから音楽生成を可能にする。
階層的トークンベース生成フレームワークを通じて長距離の一貫性と忠実度を示す。
テキストから音楽システムを評価するための高品質で専門家が注釈したベンチマーク（MusicCaps）を提供する。

提案手法

AudioLM の上に構築されたテキスト条件付き音楽生成用の階層的シーケンスツーシーケンスモデルを使用。
音声を離散トークンで表現する：SoundStream の音響トークン、w2v-BERT の意味トークン、MuLan由来の conditioning トークン。
MuLan の音声トークンを条件付けとしてオートレグレッシブに学習する半意味・音響段階、推論時には MuLan テキスト埋め込みを conditioning として使用。
メロディ（音声ベース）の conditioning に拡張し、生成ウィンドウを回転させて長時間生成を可能にする。
長期構造と音声忠実度のバランスを取るため、3 段階のパイプライン（意味モデル化、粗い音響モデル化、細かな音響モデル化）を活用。

実験結果

リサーチクエスチョン

RQ1MusicLM は複雑なテキストプロンプトに忠実な長く一貫した音楽列（数分）を生成できるか？
RQ2MusicLM は Baseline（Mubert, Riffusion）と比較して音声品質とテキストキャプションへの準拠性はどうか？
RQ3意味トークンと音響トークンを分離することがテキスト忠実度と長期一貫性にどのような影響を与えるか？
RQ4メロディベースの conditioning を追加すると、テキスト記述を尊重しつつ対象メロディへの準拠が改善されるか？

主な発見

Model	FAD_Trill ↓	FAD_VGG ↓	KLD ↓	MCC ↑	Wins ↑
Riffusion	0.76	13.4	1.19	0.34	158
Mubert	0.45	9.6	1.58	0.32	97
MusicLM	0.44	4.0	1.01	0.51	312

MusicLM はベースラインより忠実度とテキスト忠実性が高く、FAD_Trill=0.44, FAD_VGG=4.0, KLD=1.01, MCC=0.51, そして 312 人間による勝利比較を獲得。
MusicCaps (5.5k clip) は専門家が記述した音楽キャプションを提供し、厳格な評価と公開を実現。
意味トークンによる条件付けはテキスト記述への準拠性を改善し、長距離の構造を保持。
メロディー条件付き生成は、テキストプロンプトを満たしつつ入力メロディに従う能力を可能に。
数分規模の長時間生成能力を示し、キャプション間でのストーリーモード遷移が可能。
メモリ分析では正確な memorization はごく僅かで、制御されたプロンプト下での近似一致は限定的である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。