[論文レビュー] The Autonomy Tax: Defense Training Breaks LLM Agents
Defense training for LLM agents to guard against prompt injections paradoxically degrades multi-step agent competence, causing immediate step-1 failures, cascade timeouts, and higher attack bypass rates.
Large language model (LLM) agents increasingly rely on external tools (file operations, API calls, database transactions) to autonomously complete complex multi-step tasks. Practitioners deploy defense-trained models to protect against prompt injection attacks that manipulate agent behavior through malicious observations or retrieved content. We reveal a fundamental extbf{capability-alignment paradox}: defense training designed to improve safety systematically destroys agent competence while failing to prevent sophisticated attacks. Evaluating defended models against undefended baselines across 97 agent tasks and 1,000 adversarial prompts, we uncover three systematic biases unique to multi-step agents. extbf{Agent incompetence bias} manifests as immediate tool execution breakdown, with models refusing or generating invalid actions on benign tasks before observing any external content. extbf{Cascade amplification bias} causes early failures to propagate through retry loops, pushing defended models to timeout on 99\% of tasks compared to 13\% for baselines. extbf{Trigger bias} leads to paradoxical security degradation where defended models perform worse than undefended baselines while straightforward attacks bypass defenses at high rates. Root cause analysis reveals these biases stem from shortcut learning: models overfit to surface attack patterns rather than semantic threat understanding, evidenced by extreme variance in defense effectiveness across attack categories. Our findings demonstrate that current defense paradigms optimize for single-turn refusal benchmarks while rendering multi-step agents fundamentally unreliable, necessitating new approaches that preserve tool execution competence under adversarial conditions.
研究の動機と目的
- Characterize three agent-specific failure modes induced by defense training in multi-step LLM agents.
- Diagnose the root cause as shortcut learning that exploits surface correlations rather than semantic threat understanding.
- Develop diagnostic methodologies and datasets to isolate step-1 incompetence, cascade dynamics, and trigger-bias weaknesses.
- Evaluate multiple defense methods across diverse agent tasks to quantify end-to-end reliability losses.
提案手法
- Formalize defense-training effects as an capability-alignment paradox in multi-step agents.
- Introduce Step-1 execution analysis to isolate incompetence before any observations.
- Define cascade failure metrics and depth-stratified completion rates to capture retry dynamics.
- Design two diagnostic datasets: AgentDojo-based 97-task benchmark and a 350-sample curated adversarial-benign set with controlled trigger injections.
- Evaluate three defense configurations (StruQ, SecAlign, Meta SecAlign) across three base models (Llama-3-8B, Llama-3.1-8B, Mistral-7B).
- Report metrics including completion rate (CR), cascade failure rate (CFR), true/false positive rates for attack detection under shortcut learning.]
- 研究質問としては以下を含む:
- Resultダイアログ等の翻訳は省略します。

実験結果
リサーチクエスチョン
- RQ1Do defense-trained LLM agents exhibit unique multi-step failure modes not captured by single-turn benchmarks?
- RQ2What are the dominant failure mechanisms (e.g., agent incompetence, cascade amplification, trigger bias) in defended agents?
- RQ3Is defense training creating surface shortcuts that reduce both security and utility when facing sophisticated, shortcut-evading attacks?
- RQ4How do defected defenses affect end-to-end agent task success across depth and retry dynamics?
主な発見
- Defense training causes Step-1 incompetence on benign tasks for multi-step agents, with immediate refusals or invalid outputs before tool observations.
- Cascade amplification leads to dramatically higher timeouts in defended models (up to 99% CFR) compared to baseline (13–50% CFR).
- Trigger bias enables high attack bypass rates (73–86%) while simultaneously increasing false refusals on benign content (25–71% FPR).
- Defense methods exhibit substantial variance in effectiveness across attack categories, demonstrating shortcut learning rather than semantic threat understanding.
- Overall, defended agents show qualitatively worse end-to-end reliability than undefended baselines across 97 tasks and 1,000 adversarial prompts.
- A unified explanation attributes failures to shortcut learning from defense datasets correlating surface cues with labels, not semantic threat detection.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。