QUICK REVIEW

[论文解读] Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Xiaoxuan Guo, Yuankun Xie|arXiv (Cornell University)|Jan 30, 2026

Speech Recognition and Synthesis被引用 0

一句话总结

论文提出 SDD-APALLM，是一种通过显式暴露时–频声学证据（经 CQT 转换）来辅助音频大模型的声学增强模型，以改善语音深度伪造检测并提高对域偏移的鲁棒性，同时提供原始音频。

ABSTRACT

Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.

研究动机与目标

识别为何依赖语义线索的音频大模型在语音深度伪造检测中的域迁移下表现不佳。
提出一个声学增强框架（SDD-APALLM），向音频大模型暴露细粒度时–频证据。
证明显式声学证据在不改变预训练编码器的前提下提升鲁棒性和可解释性。

提出的方法

用互补的听觉视图（原始音频）和时–频视图（CQT）表示话语。
将 CQT 的幅度转换为 dB 作为与原始音频一起的视觉证据。
在共享的 LLM 空间中通过多模态对齐器整合模态，在同一提示中交错音频标记和 CQT 标记。
通过标准的监督微调（因果语言模型目标）训练，提示输出真实/伪标签。
在 ASVspoof2019 LA 和 ASVspoof2021 LA 上进行评估，并在光谱图类型和模型规模上进行消融分析。

Figure 1: Illustration of the capability gap of audio LLMs in speech deepfake detection. While audio LLMs exhibit strong semantic understanding, they struggle with reliable deepfake detection when acoustic evidence is accessed implicitly. Introducing explicit time–frequency representations reshapes

实验结果

研究问题

RQ1显式获得细粒度声学证据是否可以缓解用于 SDD 的音频 LLM 的语义捷径学习？
RQ2将原始音频与结构化的时–频表示结合，是否能提升在域内与跨域检测的鲁棒性？
RQ3哪些时–频表示（如 CQT、Mel、STFT）在不同模型规模的 LLM 基 SDD 中最具收益？

主要发现

音频 LLM 在零样本 SDD 中表现近乎随机，但在仅音频监督下性能显著提高。
通过 CQT（与原始音频并用）提供的显式声学证据，相较于仅声学证据或仅音频输入，能带来进一步收益，并在域迁移下提升鲁棒性。
较大模型在使用原始音频时可能强化语义捷径，但显式声学线索可稳定推理并改善跨域表现。
显式声学证据在推理过程中通过同时关注音频和可视化（CQT）标记得到体现。
SDD-APALLM 在 ASVspoof2019 LA 的 Audio+CQT 条件下达到 99.46% ACC，超过此前的音频 LLM 基方法以及许多端到端模型。
收益归因于对局部时–频模式的更好访问，而不仅仅是信息量的增加。

Figure 2: Overview of the proposed SDD-APALLM. The framework combines raw audio and CQT spectrograms to explicitly present fine-grained acoustic evidence through time–frequency representations, facilitating speech deepfake detection within audio LLMs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。