QUICK REVIEW

[論文レビュー] Recursive language models for jailbreak detection: a procedural defense for tool-augmented agents

Doron Shavit|arXiv (Cornell University)|Feb 18, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

要約: 論文は RLM-JB を提示する。これは再帰的言語モデルを用いて、入力セグメント全体で証拠を非難聖化・分割・スクリーニング・統合する手続き的なジャイルブレイク検出器で、AutoDANスタイルの攻撃に対して複数のバックエンドで高い再現率と精度を達成する。

ABSTRACT

Jailbreak prompts are a practical and evolving threat to large language models (LLMs), particularly in agentic systems that execute tools over untrusted content. Many attacks exploit long-context hiding, semantic camouflage, and lightweight obfuscations that can evade single-pass guardrails. We present RLM-JB, an end-to-end jailbreak detection framework built on Recursive Language Models (RLMs), in which a root model orchestrates a bounded analysis program that transforms the input, queries worker models over covered segments, and aggregates evidence into an auditable decision. RLM-JB treats detection as a procedure rather than a one-shot classification: it normalizes and de-obfuscates suspicious inputs, chunks text to reduce context dilution and guarantee coverage, performs parallel chunk screening, and composes cross-chunk signals to recover split-payload attacks. On AutoDAN-style adversarial inputs, RLM-JB achieves high detection effectiveness across three LLM backends (ASR/Recall 92.5-98.0%) while maintaining very high precision (98.99-100%) and low false positive rates (0.0-2.0%), highlighting a practical sensitivity-specificity trade-off as the screening backend changes.

研究の動機と目的

Frame jailbreak detection as a bounded, auditable procedure rather than a one-shot classifier.
Develop an RLM-based pipeline (de-obfuscation, coverage-enforcing chunking, parallel per-segment screening, and cross-chunk aggregation).
Evaluate robustness and usability across multiple screening backends and attack surfaces (AutoDAN-style and InjectPrompt).
Provide deployment-relevant metrics (ASR/Recall, FPR, Precision, F1) and discuss trade-offs.
Offer insights into how procedural analysis improves resilience to long-context hiding and split-payload attacks

提案手法

Introduce RLM-JB where a root LM orchestrates code execution and worker calls to analyze input segments.
Normalize and de-obfuscate suspicious inputs (e.g., Base64).
Chunk inputs into overlapping segments to guarantee coverage and reduce context dilution.
Screen each chunk with parallel worker LLMs returning segment verdicts and signals.
Aggregate segment-level evidence conservatively to produce a global verdict with explanations and supporting signals.
Report metrics including Recall, FPR, Precision, and F1 across backends (ASR/Recall, FPR, Precision, F1) and compare against baselines.

実験結果

リサーチクエスチョン

RQ1How effective is a recursive, procedural detector at identifying jailbreak payloads across different LLM backends?
RQ2Does chunking and cross-segment aggregation improve detection of long-context hiding and split-payload attacks compared to single-pass screening?
RQ3What is the trade-off between recall and false positive rate when varying the screening backend models?
RQ4Can the RLM-JB pipeline generalize to newer prompt-injection techniques and surface-form variants?
RQ5What is the relative contribution of the procedural approach versus the screening model to overall performance?

主な発見

Metric	DeepSeek-V3.2	GPT-4o	GPT-5.2
ASR (Recall) [%]	92.50	97.00	98.00
FPR [%]	0.00	0.50	2.00
Precision [%]	100.00	99.74	98.99
F1 Score [%]	96.10	98.35	98.49

RLM-JB achieves high recall across backends (92.5–98.0%) with very high precision (98.99–100%).
FPR rises with stronger backends, from 0.0% (DeepSeek-V3.2) to 0.5% (GPT-4o) and 2.0% (GPT-5.2).
Baseline GPT-5.2 without RLM-JB yields ASR 59.57%, FPR 1.67%, Precision 100%, F1 69.71%.
RLM-JB improves ASR to 98.00% with GPT-5.2 while maintaining Precision 98.99% and FPR 2.00%.
InjectPrompt evaluation shows 100% attack detection and 0 false positives, indicating robustness to latest injection techniques.
Compared to other defenses, RLM-JB offers substantial gains in F1 and robustness in AutoDAN-style settings; latency costs are acknowledged as a trade-off.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。