QUICK REVIEW

[論文レビュー] Voices of Civilizations: A Multilingual QA Benchmark for Global Music Understanding

Shangda Wu, Ziya Zhou|arXiv (Cornell University)|Feb 28, 2026

Music and Audio Processing被引用数 0

ひとこと要約

VoCは、380曲、38言語、1,190問を用いて、音声LLMsの文化的理解を長尺の音楽で評価する初の多言語QAベンチマークであり、十分に表されていない伝統におけるギャップを浮き彫りにする。

ABSTRACT

We introduce Voices of Civilizations, the first multilingual QA benchmark for evaluating audio LLMs' cultural comprehension on full-length music recordings. Covering 380 tracks across 38 languages, our automated pipeline yields 1,190 multiple-choice questions through four stages - each followed by manual verification: 1) compiling a representative music list; 2) generating cultural-background documents for each sample in the music list via LLMs; 3) extracting key attributes from those documents; and 4) constructing multiple-choice questions probing language, region associations, mood, and thematic content. We evaluate models under four conditions and report per-language accuracy. Our findings demonstrate that even state-of-the-art audio LLMs struggle to capture subtle cultural nuances without rich textual context and exhibit systematic biases in interpreting music from different cultural traditions. The dataset is publicly available on Hugging Face to foster culturally inclusive music understanding research.

研究の動機と目的

言語をまたぐ長尺の音楽録音における文化的属性を現代の音声LLMsが理解しているか評価する。
地域、ムード、テーマをカバーする多言語・文化的焦点のQAベンチマークを作成する。
自動生成と手動検証を組み合わせたデータセットを提供し、バイアスと文脈依存性を研究する。

提案手法

四段階の自動化パイプライン：楽曲選択、母国語と英語の文脈・文書生成、属性抽出（地域、ムード、テーマ）、多肢選択式問題の作成。
Gemini 2.5 Proを用いて二言語の文脈文書と問題を生成。
四つの設定でモデルを評価：ノイズ、英語でのAudio QA、楽曲言語でのAudio、Audio＋Doc。
言語別の正確さを報告し、言語横断の文化理解とテキスト文脈効果を分析。

Figure 1 : Example questions from the Voices of Civilizations benchmark on three folk songs—Arabic "Jafra," Chinese "Liuyang River", and Korean "Arirang."

実験結果

リサーチクエスチョン

RQ1音声のみを用いて、長尺の音楽から文化属性（地域、ムード、テーマ）を音声だけで正確に識別できるか。
RQ2背景テキスト文脈を提供すると、言語や伝統ごとの性能にどのような影響があるか。
RQ3モデルには高資源言語やよく表現された文化に対する系統的バイアスが見られるか。
RQ4質問言語と楽曲言語の一致（言語マッチング）が対跨言語理解に与える影響は。

主な発見

Setting	Language	Region	Mood	Theme
Noise	Gemini 2.5 Pro	93.42	42.48	40.15	47.50
Noise	Qwen2.5-Omni-7B	51.05	26.11	23.48	31.25
Noise	Kimi-Audio-7B-Instruct	40.26	23.11	23.45	24.06
Audio (Eng QA)	Gemini 2.5 Pro	99.74	73.01	62.50	85.00
Audio (Eng QA)	Qwen2.5-Omni-7B	86.32	46.02	51.89	56.25
Audio (Eng QA)	Kimi-Audio-7B-Instruct	85.79	40.91	41.15	48.75
Audio	Gemini 2.5 Pro	100.00	75.22	62.12	87.19
Audio	Qwen2.5-Omni-7B	89.47	44.25	50.38	58.13
Audio	Kimi-Audio-7B-Instruct	85.26	42.05	41.15	50.62
Audio + Doc	Gemini 2.5 Pro	100.00	99.12	89.39	98.12
Audio + Doc	Qwen2.5-Omni-7B	97.37	97.35	83.71	94.69
Audio + Doc	Kimi-Audio-7B-Instruct	93.42	81.06	93.36	92.81

音声からの言語識別は一般に容易で、設定を問わず>85%の正確さ。
音声のみからの地域・ムード・テーマの理解は限定的で、言語識別よりはるかに低い正確さ。
背景文書の提供は性能を劇的に向上させ、いくつかの設定でほぼ完璧に近いスコアに達するモデルもある。
言語間の性能は大きく均一ではなく、高資源言語で高く、低資源の伝統では急激な低下を示す。
Audio+Doc設定は最も大きな改善を示し、音声ベースの文化的推論よりもテキスト文脈依存が強いことを浮き彫りにする。
モデルには依然として表現された文化へのバイアスがあり、より多様な訓練データの必要性を示している。

Figure 2 : Per-language accuracy (%) of three state-of-the-art audio LLMs on the VoC benchmark using audio input only and focusing on region, mood, and theme questions. We invited a Chinese music teacher to answer 29 questions across 10 Chinese songs in a strictly closed-book setting (no reference o

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。