QUICK REVIEW

[論文レビュー] ALERT: Zero-shot LLM Jailbreak Detection via Internal Discrepancy Amplification

Xiao Lin, Philip Li|arXiv (Cornell University)|Jan 7, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

ALERT は layer-, module-, token レベルの安全信号を増幅して未見の jailbreak プロンプトを検出し、複数のベンチマークとLLMでゼロショット性能のトップ層を実現する。

ABSTRACT

Despite rich safety alignment strategies, large language models (LLMs) remain highly susceptible to jailbreak attacks, which compromise safety guardrails and pose serious security risks. Existing detection methods mainly detect jailbreak status relying on jailbreak templates present in the training data. However, few studies address the more realistic and challenging zero-shot jailbreak detection setting, where no jailbreak templates are available during training. This setting better reflects real-world scenarios where new attacks continually emerge and evolve. To address this challenge, we propose a layer-wise, module-wise, and token-wise amplification framework that progressively magnifies internal feature discrepancies between benign and jailbreak prompts. We uncover safety-relevant layers, identify specific modules that inherently encode zero-shot discriminative signals, and localize informative safety tokens. Building upon these insights, we introduce ALERT (Amplification-based Jailbreak Detector), an efficient and effective zero-shot jailbreak detector that introduces two independent yet complementary classifiers on amplified representations. Extensive experiments on three safety benchmarks demonstrate that ALERT achieves consistently strong zero-shot detection performance. Specifically, (i) across all datasets and attack strategies, ALERT reliably ranks among the top two methods, and (ii) it outperforms the second-best baseline by at least 10% in average Accuracy and F1-score, and sometimes by up to 40%.

研究の動機と目的

現実的な攻撃進化を反映するゼロショット jailbreak 検出タスクの動機付けと形式化。
実世界の安全設定における検出器の実用的原則（一般化、効率性、無害性）を特定。
内部の安全信号を可視化する層・モジュール・トークンごとの増幅フレームワークを開発。
増幅表現と軽量分類器を組み合わせたモデル非依存の検出器（ALERT）を提供。

提案手法

benign、有害、 jailbreak プロンプトの層ごとの分布を対称KLダイバージェンスで分析して安全感度の高い層を特定。
特定した層内で、ゲーティング特徴と文脈特徴を用いた二つの分類器を Variational Information Bottleneck (VIB) ベースラインで構築し、モジュールごとの増幅を実施。
トークンごとの増幅を導入し、良性と有害プロンプトから得られるプロトタイプベクトルに向けてトークン特徴を重みづけすることで jailbreak テンプレートのノイズトークンを低減。
ゲーティングと文脈分類器の出力を平均化して頑健な予測を得た後、分類前にプロンプト表現を refine するためにトークンレベルの重みづけを適用。
軽量な検出器による単一順伝播検出を保証し、効率性と無害性の基準を満たす。

実験結果

リサーチクエスチョン

RQ1ゼロショット jailbreak 検出は訓練データに jailbreak テンプレートが全く含まれていなくても unseen jailbreak プロンプトを信頼性高く識別できるか。
RQ2LLM の内部表現のうち、どの層・モジュール・トークンに最も強いゼロショットの安全信号が含まれるか。
RQ3層・モジュール・トークンレベルの増幅機構はゼロショット jailbreak 検出性能を向上させるか。
RQ4良性プロンプト品質を保ちながら、軽量でモデル非依存の検出器で効果的な検出を達成できるか。

主な発見

ALERT はゼロショット設定の全評価データセットおよび攻撃で常に上位2つの手法の1つにランクイン。
すべてのLLMにおいて、ALERT は平均して 90% を超える精度と F1 スコアを達成。
ALERT は平均精度と F1 スコアで2番目のベースラインを少なくとも 10%、場合によっては最大 40% 上回る。
三つの増幅段階（層・モジュール・トークンごと）は検出性能を総合的に向上させ、特にモジュールごとの増幅が最大の改善をもたらす。
トークンごとの増幅はノイズのある jailbreak トークンの干渉を低減し、ゼロショット検出の識別性を高める。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。