QUICK REVIEW

[論文レビュー] Agent Privilege Separation in OpenClaw: A Structural Defense Against Prompt Injection

Darren Cheng, Wen-Kwang Tsao|arXiv (Cornell University)|Mar 13, 2026

Adversarial Robustness in Machine Learning被引用数 0

ひとこと要約

要約: 本研究は二つの機構を用いた防御を提案する——エージェント分離と権限分離、そしてJSON構造化によるエージェント間通信——OpenClawを対象としたプロンプト注入に対して、組み合わせ時に649件の攻撃に対してASRをゼロにする。

ABSTRACT

Prompt injection remains one of the most practical attack vectors against LLM-integrated applications. We replicate the Microsoft LLMail-Inject benchmark (Greshake et al., 2024) against current generation models running inside OpenClaw, an open source multitool agent platform. Our proposed defense combines two mechanisms: agent isolation, implemented as a privilege separated two-agent pipeline with tool partitioning, and JSON formatting, which produces structured output that strips persuasive framing before the action agent processes it. We run four experiments on the same 649 attacks that succeeded against our single-agent baseline. The full pipeline achieves 0 percent attack success rate (ASR) on the evaluated benchmark. Agent isolation alone achieves 0.31 percent ASR, approximately 323 times lower than the baseline. JSON formatting alone achieves 14.18 percent ASR, about 7.1 times lower. Our ablation study confirms that agent isolation is the dominant mechanism. JSON formatting provides additional hardening but is not sufficient on its own. The defense is structural: the action agent never receives raw injection content regardless of model behavior on any individual input.

研究の動機と目的

LLM統合エージェントシステムにおけるプロンプト注入リスクを動機づけ formalizeする。
二部構成の防御を提案する： (i) 分離されたツールアクセスを持つエージェント分離、 (ii) バリデータ付きのJSON構造化エージェント間通信。
OpenClawに適応したLLMail-Injectベンチマークとベースライン単一エージェント設定を用いて防御を評価する。
649件のベースライン成功攻撃をアブレーションして各機構の寄与を定量化する。

提案手法

エージェントをReader（特権ツールなし）とActor（特権ツールあり）に分割して最小権限を適用。
Readerが生データメールを検証済みJSONサマリーに直列化する。
検証済みJSON出力のみをツール制限下のActorに渡す。
潜在的な注入内容をフラグ付けする軽量JSONバリデータを含める（測定用の監査モードで使用）。
4つの構成を比較する：Baseline、JSON Validator Only、Two-Agent Only、Pipeline（完全防御）。
単一エージェントベースラインで成功した649件の攻撃を正規化比較に使用。

Figure 1: Architecture comparison. (a) Baseline : a single agent holds send_email and processes untrusted email content—a successful injection directly triggers unauthorized actions. (b) Our pipeline : the Reader agent (no privileged tools) serializes email to a validated JSON summary; the Actor age

実験結果

リサーチクエスチョン

RQ1エージェント間でツールレベルの分離を強制することで、プロンプト注入の成功率をどれだけ削減できるか？
RQ2分離だけでは得られない追加の利得はJSON構造化されたエージェント間通信により得られるか？
RQ3マルチツールLLMエージェントにおいてエージェント分離はプロンプト注入に対する支配的な防御機構か？
RQ4組み合わせ防御はOpenClaw内のLLMail-Injectベンチマークで攻撃成功率をゼロにできるか？

主な発見

Configuration	ASR	Succ.	Def.	Improv.
Baseline	100%	649	0%	—
Pipeline	0.0%	0	100%	∞
Two-Agent Only	0.31%	2	99.7%	323×
JSON Validator Only	14.18%	92	85.8%	7.1×

エージェント分離によりASRは0.31%（649のベースライン攻撃中2件成功）、ベースラインに対する323×の改善。
完全パイプライン（分離＋JSON整形）はベンチマークでASRを0%に低下させ、無限に近い改善を実現。
JSONのみの整形はASRを14.18%（92件成功）に低下させるが、単独では十分でない。
分離を伴うJSON検証は残存攻撃を排除。JSONだけでは全ての注入を防げず、分離の優位性を強調。
さまざまなシナリオでデータ外部送信とRAGタスクが非分離防御にとって最も難しいが、分離により完全にブロックされる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。