QUICK REVIEW

[논문 리뷰] ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li|arXiv (Cornell University)|2026. 01. 07.

Adversarial Robustness in Machine Learning인용 수 0

한 줄 요약

모델-무관 제로샷 jailbreak 탐지기 ALERT는 레이어-, 모듈-, 및 토큰 수준의 안전 신호를 증폭하여 보이지 않는 jailbreak 프롬프트를 탐지하고, 다수의 벤치마크 및 LLM에서 제로샷 성능 최상위에 도달한다.

ABSTRACT

Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.

연구 동기 및 목표

현실적인 공격 진화를 반영하도록 제로샷 jailbreak 탐지 작업에 동기 부여와 형식을 부여한다.
현실 세계의 안전 환경에서 탐지기의 일반화성, 효율성, 무해성이라는 실용적 원칙을 식별한다.
내부 안전 신호를 드러내기 위한 레이어-, 모듈-, 토큰 단위 증폭 프레임워크를 개발한다.
증폭된 표현과 경량 분류기를 결합하는 모델-무관 탐지기(ALERT)를 제공한다.

제안 방법

대칭 KL 발산을 사용하여 안전한 프롬프트, 유해 프롬프트, jailbreak 프롬프트의 레이어별 분포를 분석하여 안전 민감 레이어를 식별한다.
식별된 레이어 내에서 게이팅 및 컨텍스트 특징에 기반한 두 개의 분류기를 구성해 변분 정보 병 bottleneck(VIB) 백본을 사용한 모듈 단위 증폭을 수행한다.
토큰 특징을 정상 및 유해 프롬프트에서 도출된 프로토타입 벡터를 향해 가중치를 부여하여 jailbreak 템플릿의 노이즈 토큰의 가중치를 낮추는 토큰 단위 증폭을 도입한다.
게이팅 및 컨텍스트 분류기의 출력을 평균을 통해 강건한 예측으로 결합하고, 분류 전에 토큰 수준 가중치를 적용하여 프롬프트 표현을 다듬는다.
효율성과 무해성 기준을 충족하기 위해 경량 탐지기를 사용하여 단일 순전파 탐지를 보장한다.

실험 결과

연구 질문

RQ1훈련 데이터에 jailbreak 템플릿이 전혀 없더라도 제로샷 jailbreak 탐지가 보이지 않는 jailbreak 프롬프트를 신뢰할 수 있게 식별할 수 있는가?
RQ2어떤 내부 표현(레이어, 모듈, 토큰)이 LLM에서 가장 강한 제로샷 안전 신호를 담고 있는가?
RQ3레이어, 모듈, 토큰 수준의 증폭 메커니즘이 제로샷 jailbreak 탐지 성능을 개선하는가?
RQ4무해한 프롬프트 품질을 유지하면서 경량의 모델-무관 탐지기로도 효과적인 탐지가 가능할까?

주요 결과

Alert는 제로샷 설정에서 모든 평가 데이터셋과 공격에 대해 일관되게 상위 두 방법 중 하나로 랭크된다.
모든 LLM에 걸쳐 Alert는 평균적으로 90%를 초과하는 정확도 및 F1-score를 달성한다.
Alert는 평균 정확도와 F1-score에서 두 번째 베이스라인보다 최소 10% 더 우수하며 경우에 따라 최대 40%까지 차이가 난다.
세 가지 증폭 단계(레이어-, 모듈-, 토큰-단위)가 총 탐지 성능을 높이며, 모듈 단위 증폭이 가장 큰 향상을 제공한다.
토큰 단위 증폭은 노이즈 jailbreak 토큰의 간섭을 줄이고 제로샷 탐지에서 판별력을 향상시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.