QUICK REVIEW

[论文解读] When Noise Lowers The Loss: Rethinking Likelihood-Based Evaluation in Music Large Language Models

Xiaosha Li, Chun Y. Liu|arXiv (Cornell University)|Feb 2, 2026

Music and Audio Processing被引用 0

一句话总结

论文揭示基于似然的损失在音乐被扰动时会下降，提出“情境遗忘效应（Context Amnesia Effect）”，并建议通过分析损失曲线的形状而非绝对损失来评估音乐大模型。

ABSTRACT

The rise of music large language models (LLMs) demands robust methods of evaluating output quality, especially in distinguishing high-quality compositions from "garbage music". Curiously, we observe that the standard cross-entropy loss -- a core training metric -- often decrease when models encounter systematically corrupted music, undermining its validity as a standalone quality indicator. To investigate this paradox, we introduce noise injection experiment, where controlled noise signal of varying lengths are injected into musical contexts. We hypothesize that a model's loss reacting positively to these perturbations, specifically a sharp increase ("Peak" area) for short injection, can serve as a proxy for its ability to discern musical integrity. Experiments with MusicGen models in the audio waveform domain confirm that Music LLMs respond more strongly to local, texture-level disruptions than to global semantic corruption. Beyond exposing this bias, our results highlight a new principle: the shape of the loss curve -- rather than its absolute value -- encodes critical information about the quality of the generated content (i.e., model behavior). We envision this profile-based evaluation as a label-free, model-intrinsic framework for assessing musical quality -- opening the door to more principled training objectives and sharper benchmarks.

研究动机与目标

推动对音乐大模型输出的稳健评估，超越绝对损失值。
在受控噪声扰动的音乐序列中演示反直觉的损失行为。
引入并表征在逐 token 的损失动态中的情境遗忘效应（Context Amnesia Effect）。
提出基于轮廓的评估框架，使用损失曲线形状来评估音乐质量。

提出的方法

通过对音频输入加入不同长度的扰动进行噪声注入实验，并测量逐 token 的损失变化。
定义每个 token 的损失差 Delta ell_t 以量化扰动影响。
在多种 MusicGen 模型和数据集（TrainingSet、Generated、OOD）以及不同扰动长度下分析损失行为。
扩展分析至其他扰动形式，如顺序打乱，以测试发现的普适性。

实验结果

研究问题

RQ1绝对交叉熵损失是否能可靠反映音乐大模型中的扰动或质量？
RQ2扰动长度如何影响损失，以及扰动下的逐 token 损失曲线形状如何？
RQ3以损失曲线的轮廓视图（峰值、同化、恢复）是否比原始损失更能指示音乐质量？
RQ4在模型、数据集和扰动类型（噪声与打乱）上，发现是否具有鲁棒性？

主要发现

短扰动引发尖锐的损失峰值，但较长的扰动会降低损失（情境遗忘）。
在不同模型和数据集上，增加扰动长度会产生负的损失差，表明损失随扰动增大而降低。
绝对损失在检测扰动或音乐质量方面不可靠；损失曲线的形状提供更可靠的信号（起始峰值尤其具有信息量）。
识别出三阶段的逐 token 损失动态：峰值区域、同化区域、恢复区域。
顺序打乱也呈现类似的损失曲线Pattern，证实情境遗忘效应的普遍性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。