QUICK REVIEW

[论文解读] Musical Training, but not Mere Exposure to Music, Drives the Emergence of Chroma Equivalence in Artificial Neural Networks

Lukas Grasse, Matthew S. Tata|arXiv (Cornell University)|Feb 20, 2026

Neuroscience and Music Perception被引用 0

一句话总结

该研究表明色度等效性在人工智能中只有在经过监督音乐转写微调后才会出现；仅暴露或自监督训练的音乐并不会出现；音高高度更具普遍表示性。

ABSTRACT

Pitch is a fundamental aspect of auditory perception. Pitch perception is commonly described across two perceptual dimensions: pitch height is the sense that tones with varying frequencies seem to be higher or lower, and chroma equivalence is the cyclical similarity of notes octaves, corresponding to a doubling of fundamental frequency. Existing research is divided on whether chroma equivalence is a learned percept that varies according to musical experience and culture, or is an innate percept that develops automatically. Building on a recent framework that proposes to use ANNs to ask 'why' questions about the brain, we evaluated recent auditory ANNs using representational similarity analysis to test the emergence of pitch height and chroma equivalence in their learned representations. Additionally, we fine-tuned two models, Wav2Vec 2.0 and Data2Vec, on a self-supervised learning task using speech and music, and a supervised music transcription task. We found that all models exhibited varying degrees of pitch height representation, but that only models trained on the supervised music transcription task exhibited chroma equivalence. Mere exposure to music through self-supervised learning was not sufficient for chroma equivalence to emerge. This supports the view that chroma equivalence is a higher-order cognitive computation that emerges to support the specific task of music perception, distinct from other auditory perception such as speech listening. This work also highlights the usefulness of ANNs for probing the developmental conditions that give rise to perceptual representations in humans.

研究动机与目标

研究不同训练方案下，人工智能中是否会出现音高高度与色度等效性。
确定自监督暴露于音乐或语言是否驱动色度等效性。
评估是否需要对色度等效性进行音乐转写任务的监督微调。
将预训练、自监督和监督微调模型与色度和音高模型进行RSA比较。

提出的方法

在SSL、SL或SSL+SFT条件下评估基于变换器的听觉模型（Wav2Vec 2.0、Data2Vec、Whisper、MERT、AST）。
通过在语音+音乐数据上微调模型，测试被动音乐暴露对色度形成的影响。
在多声部钢琴音乐转写（MAESTRO）上微调模型，测试主动音乐任务对色度形成的影响。
使用表征相似性分析（RSA）将模型嵌入与音高高度和色度等效性模型进行比较。
从 NSynth 的音符中选取4–6 八度的声音进行 RSA，涵盖30 种乐器（每种木 flute、吉他、键盘各10种）。
分析噪声上限并进行 Bonferroni 校正的统计检验。

实验结果

研究问题

RQ1在任何训练方案下，人工智能中是否会出现音高高度的表示？
RQ2自监督暴露于音乐或语言是否会出现色度等效性？
RQ3在音乐转写任务上的监督微调是否会比其他任务更能诱发色度等效性？
RQ4仅暴露于音乐（SSL+ 暴露）是否足以在人工智能表示中形成色度？

主要发现

所有预训练/自监督人工智能都编码音高高度，但未出现色度等效性。
通过自监督微调将音乐引入训练数据并不会产生色度等效性。
在音乐转写任务上的监督微调可以在 Wav2Vec 2.0 和 Data2Vec 中诱发色度等效性。
在语音识别上的微调未带来色度等效性提升，尽管音高高度编码相似或增加。
基于 CQT 的模型展示了色度等效性，这是由其设计所致，而非由一般训练自然出现。
音高高度表示看起来相对自动，而色度等效性反映更高层次的音乐相关计算。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。