QUICK REVIEW

[논문 리뷰] Towards Explicit Acoustic Evidence Perception in Audio LLMs for Speech Deepfake Detection

Xiaoxuan Guo, Yuankun Xie|arXiv (Cornell University)|2026. 01. 30.

Speech Recognition and Synthesis인용 수 0

한 줄 요약

논문은 SDD-APALLM을 제안한다. 이는 원시 오디오와 함께 시간–주파수 음향 증거(CQT)를 명시적으로 노출하여 음성 딥페이크 탐지 및 도메인 시프트에 대한 강인성을 향상시키는 음향적으로 강화된 오디오 LLM이다.

ABSTRACT

Speech deepfake detection (SDD) focuses on identifying whether a given speech signal is genuine or has been synthetically generated. Existing audio large language model (LLM)-based methods excel in content understanding; however, their predictions are often biased toward semantically correlated cues, which results in fine-grained acoustic artifacts being overlooked during the decisionmaking process. Consequently, fake speech with natural semantics can bypass detectors despite harboring subtle acoustic anomalies; this suggests that the challenge stems not from the absence of acoustic data, but from its inadequate accessibility when semantic-dominant reasoning prevails. To address this issue, we investigate SDD within the audio LLM paradigm and introduce SDD with Auditory Perception-enhanced Audio Large Language Model (SDD-APALLM), an acoustically enhanced framework designed to explicitly expose fine-grained time-frequency evidence as accessible acoustic cues. By combining raw audio with structured spectrograms, the proposed framework empowers audio LLMs to more effectively capture subtle acoustic inconsistencies without compromising their semantic understanding. Experimental results indicate consistent gains in detection accuracy and robustness, especially in cases where semantic cues are misleading. Further analysis reveals that these improvements stem from a coordinated utilization of semantic and acoustic information, as opposed to simple modality aggregation.

연구 동기 및 목표

도메인 시프트에서 음성 딥페이크 탐지에 있어 의미 체계 단서에 의존하는 음향 LLM이 어려움을 겪는 원인을 식별한다.
오디오 LLM에 미세한 시간–주파수 증거를 노출하는 음향 보강 프레임워크(SDD-APALLM)를 제안한다.
사전 학습된 인코더를 바꾸지 않고도 명시적 음향 증거가 강인성과 해석가능성을 개선함을 보여준다.

제안 방법

발화문을 보완적 청각 뷰(원시 오디오)와 시간–주파수 뷰(CQT)로 표현한다.
원시 오디오와 함께 시각적 증거로 CQT 크기를 dB로 변환하여 사용한다.
멀티모달 정렬기를 사용하여 오디오 토큰과 CQT 토큰을 하나의 프롬프트에 교차 배치하는 공유 LLM 공간에서 모달리티를 통합한다.
프롬프트를 이용한 실제/가짜 레이블 출력이라는 인과적 LM 목표를 갖춘 표준 감독 미세조정으로 학습한다.
spectrogram 유형과 모델 규모에 따른 차등 제거를 통해 ASVspoof2019 LA 및 ASVspoof2021 LA에서 평가한다.

Figure 1: Illustration of the capability gap of audio LLMs in speech deepfake detection. While audio LLMs exhibit strong semantic understanding, they struggle with reliable deepfake detection when acoustic evidence is accessed implicitly. Introducing explicit time–frequency representations reshapes

실험 결과

연구 질문

RQ1명시적으로 접근 가능한 미세한 음향 증거가 SDD를 위한 오디오 LLM의 의미 체계 단축 학습을 완화할 수 있는가?
RQ2원시 오디오와 구조화된 시간–주파수 표현을 결합하는 것이 도메인 내/교차 도메인 탐지 강인성을 향상시키는가?
RQ3어떤 시간–주파수 표현(CQT, Mel, STFT)이 모델 규모 전반에서 LLM 기반 SDD에 가장 큰 이점을 주는가?

주요 결과

오디오 LLM은 제로샷 SDD에서 거의 무작위 수준의 성능을 보이나 오디오 단일 감독에서 크게 향상된다.
원시 오디오와 함께 CQT를 통한 명시적 음향 증거는 음향-전용 입력 또는 오디오-단독 입력보다 추가 이점을 제공하고 도메인 시프트에서 강인성을 향상시킨다.
더 큰 모델은 원시 오디오를 사용할 때 의미 체계 단축 학습을 증폭시킬 수 있지만, 명시적 음향 신호는 추론을 안정시키고 교차 도메인 성능을 개선한다.
명시적 음향 증거가 추론 과정에 시청각(CQT) 토큰에 대한 주의 집중으로 통합되었다는 점이 확인된다.
SDD-APALLM은 Audio+CQT 조합으로 ASVspoof2019 LA에서 99.46% ACC를 달성하며, 기존의 음향 LLM 기반 방법 및 다수의 엔드투엔드 모델을 능가한다.
향상된 지역적 시간–주파수 패턴에 대한 접근성 증가로 인한 이점이 정보 콘텐츠의 단순 추가가 아니라는 점이 기여한다.

Figure 2: Overview of the proposed SDD-APALLM. The framework combines raw audio and CQT spectrograms to explicitly present fine-grained acoustic evidence through time–frequency representations, facilitating speech deepfake detection within audio LLMs.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.