QUICK REVIEW

[論文レビュー] Musical Training, but not Mere Exposure to Music, Drives the Emergence of Chroma Equivalence in Artificial Neural Networks

Lukas Grasse, Matthew S. Tata|arXiv (Cornell University)|Feb 20, 2026

Neuroscience and Music Perception被引用数 0

ひとこと要約

研究は、ANNsにおける chroma 等価性は監督付きの音楽転写微調整の後にのみ出現することを示し、単なる露出や音楽による自己教師付き訓練だけではそうならない；ピッチ高さはより普遍的に表現される。

ABSTRACT

Pitch is a fundamental aspect of auditory perception. Pitch perception is commonly described across two perceptual dimensions: pitch height is the sense that tones with varying frequencies seem to be higher or lower, and chroma equivalence is the cyclical similarity of notes octaves, corresponding to a doubling of fundamental frequency. Existing research is divided on whether chroma equivalence is a learned percept that varies according to musical experience and culture, or is an innate percept that develops automatically. Building on a recent framework that proposes to use ANNs to ask 'why' questions about the brain, we evaluated recent auditory ANNs using representational similarity analysis to test the emergence of pitch height and chroma equivalence in their learned representations. Additionally, we fine-tuned two models, Wav2Vec 2.0 and Data2Vec, on a self-supervised learning task using speech and music, and a supervised music transcription task. We found that all models exhibited varying degrees of pitch height representation, but that only models trained on the supervised music transcription task exhibited chroma equivalence. Mere exposure to music through self-supervised learning was not sufficient for chroma equivalence to emerge. This supports the view that chroma equivalence is a higher-order cognitive computation that emerges to support the specific task of music perception, distinct from other auditory perception such as speech listening. This work also highlights the usefulness of ANNs for probing the developmental conditions that give rise to perceptual representations in humans.

研究の動機と目的

さまざまな訓練 regime の下で、ANNs においてピッチ高さと chroma 等価性が出現するかを調査する。
自己教師ありの音楽露出または音声露出が chroma 等価性を駆動するかを判断する。
chroma 等価性の出現に対して監督付きの音楽転写訓練が必要かを評価する。
RSA を用いて、事前学習済み、自己教師あり、監督付き微調整モデルを chroma および pitch モデルと比較する。

提案手法

SSL, SL, または SSL+SFT の下で、トランスフォーマー系聴覚モデル（Wav2Vec 2.0、Data2Vec、Whisper、MERT、AST）を評価する。
音楽露出の受動的影響を検証するため、speech+music データでモデルを微調整する。
活発な音楽タスクの影響を検証するため、MAESTRO（多声ピアノ音楽転写）でモデルを微調整する。
Representational Similarity Analysis (RSA) を使用し、モデルの埋め込みを pitch-height および chroma-equivalence モデルと比較する。
NSynth のノートをオクターブ4–6から抽出して RSA 用の刺激とし、楽器はフルート・ギター・キーボードの各10本、合計30楽器で実施する。
ノイズ天井を分析し、Bonferroni補正を用いた統計検定を行う。

実験結果

リサーチクエスチョン

RQ1ANNs において訓練 regime に関わらずピッチ height 表現が出現するか？
RQ2自己教師ありの音楽または音声露出で chroma 等価性が出現するか？
RQ3音楽転写タスクの監督付き微調整は chroma 等価性を他のタスクと比較して誘発するか？
RQ4音楽への単なる露出（SSL+ exposure）で ANNs の表現に chroma が現れるか？

主な発見

すべての事前学習済み/自己教師あり ANN はピッチ height をコードするが chroma 等価性はコードしない。
自己教師付き微調整によって訓練データに音楽を組み込んでも chroma 等価性は得られない。
音楽転写タスクの監督付き微調整は Wav2Vec 2.0 および Data2Vec に chroma 等価性を誘発する。
音声認識での微調整は、ピッチ height の符号化が同様または増加しても chroma 等価性の利得をもたらさない。
CQTベースのモデルは設計上 chroma 等価性を示すが、一般的な訓練からは新たに出現するものではない。
ピッチ height の表現は広く自動的に出現する一方、chrom a 等価性は音楽関連の高次の計算を反映している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。