QUICK REVIEW

[论文解读] Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding

Shangda Wu, Ziya Zhou|arXiv (Cornell University)|Feb 28, 2026

Music and Audio Processing被引用 0

一句话总结

VoC 是首个多语言问答基准，评估音频大型语言模型在整曲音乐中的文化理解能力，覆盖 380 首曲目、38 种语言与 1,190 道问题，凸显对非代表性传统的认知差距。

ABSTRACT

We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.

研究动机与目标

评估现代音频 LLMs 如何理解跨语言的音乐全长录音中的文化属性。
创建一个多语言、以文化为中心的问答基准，覆盖地区、情感与主题。
提供一个可自动生成并经人工验证的数据集，用于研究偏见和情境依赖性。

提出的方法

四阶段自动化流程：歌曲选择、以母语与英语生成的上下文/文档、属性提取（区域、情感、主题）以及多项选择题构建。
使用 Gemini 2.5 Pro 生成双语上下文文档和问题。
在四种设置下评估模型：噪声、英文音频问答、歌曲语言的音频问答，以及音频+文档。
逐语言报告准确性并分析跨语言的文化理解与文本上下文效应。

Figure 1 : Example questions from the Voices of Civilizations benchmark on three folk songs—Arabic "Jafra," Chinese "Liuyang River", and Korean "Arirang."

实验结果

研究问题

RQ1音频 LLMs 是否能仅通过音频从整曲音乐中准确识别文化属性（区域、情感、主题）？
RQ2提供背景文本上下文对不同语言和传统的表现有何影响？
RQ3模型是否对高资源语言或充分代表的文化存在系统性偏见？
RQ4语言匹配（问题语言与歌曲语言）对跨语言理解有何影响？

主要发现

Setting	Language	Region	Mood	Theme
噪声	Gemini 2.5 Pro	93.42	42.48	40.15	47.50
噪声	Qwen2.5-Omni-7B	51.05	26.11	23.48	31.25
噪声	Kimi-Audio-7B-Instruct	40.26	23.11	23.45	24.06
音频（Eng QA）	Gemini 2.5 Pro	99.74	73.01	62.50	85.00
音频（Eng QA）	Qwen2.5-Omni-7B	86.32	46.02	51.89	56.25
音频（Eng QA）	Kimi-Audio-7B-Instruct	85.79	40.91	41.15	48.75
音频	Gemini 2.5 Pro	100.00	75.22	62.12	87.19
音频	Qwen2.5-Omni-7B	89.47	44.25	50.38	58.13
音频	Kimi-Audio-7B-Instruct	85.26	42.05	41.15	50.62
音频+文档	Gemini 2.5 Pro	100.00	99.12	89.39	98.12
音频+文档	Qwen2.5-Omni-7B	97.37	97.35	83.71	94.69
音频+文档	Kimi-Audio-7B-Instruct	93.42	81.06	93.36	92.81

从音频进行语言识别对模型来说通常较容易（在所有设置下准确率>85%）。
仅靠音频理解区域、情感和主题的能力有限，准确率明显低于语言识别。
提供背景文档能显著提升性能，在若干设置中部分模型接近近乎完美的分数。
各语言的表现高度不均衡，高资源语言得分较高，低资源传统有明显下降。
音频+文档设置显示出最强的增益，突出文本上下文对音频基础的文化推理的依赖性。
模型仍然对呈现的文化存在偏见，强调需要更丰富的训练数据。

Figure 2 : Per-language accuracy (%) of three state-of-the-art audio LLMs on the VoC benchmark using audio input only and focusing on region, mood, and theme questions. We invited a Chinese music teacher to answer 29 questions across 10 Chinese songs in a strictly closed-book setting (no reference o

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。