[论文解读] MuChoMusic dataset
MusicLM 从文本描述以 24 kHz 生成高质量音乐,支持旋律条件,并推出 MusicCaps(5.5k 音乐-文本对)用于评估。
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models MuChoMusic is a benchmark designed to evaluate music understanding in multimodal language models focused on audio. It includes 1,187 multiple-choice questions validated by human annotators, based on 644 music tracks from two publicly available music datasets. These questions cover a wide variety of genres and assess knowledge and reasoning across several musical concepts and their cultural and functional contexts. The benchmark provides a holistic evaluation of five open-source models, revealing challenges such as over-reliance on the language modality and highlighting the need for better multimodal integration. Note on Audio Files This dataset comes without audio files. The audio files can be downloaded from two datasets: SongDescriberDataset (SDD) and MusicCaps. Please see the code repository for more information on how to download the audio. Citation If you use this dataset, please cite our paper: @inproceedings{weck2024muchomusic, title={MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language Models}, author={Weck, Benno and Manco, Ilaria and Benetos, Emmanouil and Quinton, Elio and Fazekas, György and Bogdanov, Dmitry}, booktitle = {Proceedings of the 25th International Society for Music Information Retrieval Conference (ISMIR)}, year={2024} }
研究动机与目标
- Motivate high-quality, long-form music generation from descriptive text.
- Enable robust text-to-music generation by leveraging a shared music-text embedding space to train on unlabeled audio data.
- Demonstrate long-range coherence and fidelity through a hierarchical token-based generation framework.
- Provide a high-quality, expert-annotated benchmark (MusicCaps) for evaluating text-to-music systems.
提出的方法
- Use a hierarchical sequence-to-sequence model built on top of AudioLM for text-conditioned music generation.
- Represent audio with discrete tokens: acoustic tokens from SoundStream, semantic tokens from w2v-BERT, and MuLan-derived conditioning tokens.
- Train semantic and acoustic stages autoregressively conditioned on MuLan audio tokens, while using MuLan text embeddings at inference as conditioning.
- Extend conditioning to melody (audio-based) and enable long-generation by rolling generation windows.
- Leverage a three-stage pipeline (semantic modeling, coarse acoustic modeling, fine acoustic modeling) to balance long-term structure with audio fidelity.
实验结果
研究问题
- RQ1Can MusicLM generate long, coherent musical sequences (minutes) faithful to complex text prompts?
- RQ2How does MusicLM compare to baselines (Mubert, Riffusion) in audio quality and adherence to text captions?
- RQ3What is the impact of separating semantic and acoustic tokens on text faithfulness and long-term coherence?
- RQ4Does adding melody-based conditioning improve adherence to a target melody while respecting text descriptions?
主要发现
| 模型 | FAD_Trill ↓ | FAD_VGG ↓ | KLD ↓ | MCC ↑ | 胜出 ↑ |
|---|---|---|---|---|---|
| Riffusion | 0.76 | 13.4 | 1.19 | 0.34 | 158 |
| Mubert | 0.45 | 9.6 | 1.58 | 0.32 | 97 |
| MusicLM | 0.44 | 4.0 | 1.01 | 0.51 | 312 |
- MusicLM achieves higher fidelity and text-faithfulness than baselines, with FAD_Trill=0.44, FAD_VGG=4.0, KLD=1.01, MCC=0.51, and 312 human-won comparisons.
- MusicCaps (5.5k clips) provides expert-written music captions for rigorous evaluation and release.
- Semantic-token conditioning improves adherence to text descriptions and preserves long-range structure.
- Melody-conditioned generation enables following an input melody while complying with textual prompts.
- Long-generation capability up to several minutes is demonstrated, with the model capable of story-mode transitions across captions.
- Memorization analysis shows negligible exact memorization, with limited approximate matches under controlled prompts.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。