QUICK REVIEW

[論文レビュー] InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated Large Language Model Agents

Qiusi Zhan, Zhixiang Liang|arXiv (Cornell University)|Mar 5, 2024

Topic Modeling被引用数 5

ひとこと要約

本論文は InjecAgent を提示します。ツール統合型 LLM エージェントの間接的プロンプトインジェクション（IPI）脆弱性を評価するベンチマークで、1,054 のテストケースにわたって 30 エージェントを評価し、特に hacking prompts を用いた場合に顕著な攻撃感受性を明らかにします。

ABSTRACT

Recent work has embodied LLMs as agents, allowing them to access tools, perform actions, and interact with external content (e.g., emails or websites). However, external content introduces the risk of indirect prompt injection (IPI) attacks, where malicious instructions are embedded within the content processed by LLMs, aiming to manipulate these agents into executing detrimental actions against users. Given the potentially severe consequences of such attacks, establishing benchmarks to assess and mitigate these risks is imperative. In this work, we introduce InjecAgent, a benchmark designed to assess the vulnerability of tool-integrated LLM agents to IPI attacks. InjecAgent comprises 1,054 test cases covering 17 different user tools and 62 attacker tools. We categorize attack intentions into two primary types: direct harm to users and exfiltration of private data. We evaluate 30 different LLM agents and show that agents are vulnerable to IPI attacks, with ReAct-prompted GPT-4 vulnerable to attacks 24% of the time. Further investigation into an enhanced setting, where the attacker instructions are reinforced with a hacking prompt, shows additional increases in success rates, nearly doubling the attack success rate on the ReAct-prompted GPT-4. Our findings raise questions about the widespread deployment of LLM Agents. Our benchmark is available at https://github.com/uiuc-kang-lab/InjecAgent.

研究の動機と目的

ツール統合型 LLM エージェントにおける間接プロンプトインジェクション（IPI）攻撃を formalize する。
IPI 耐性を検証するための広範な領域カバーを備えた包括的ベンチマーク InjecAgent を作成する。
30 個の LLM エージェントを評価して、攻撃成功率と促進されたモデルとファインチューニング済みモデルの耐性の差を定量化する。
ユーザーコンテンツと attacker prompt 戦略を含む、攻撃成功に影響を与える要因を分析する。

提案手法

攻撃者とユーザーのツールエコシステムを定義し、攻撃意図を直接的な害とデータ外部化に分類する。
GPT-4 を用いて17 のユーザーケースと62 の攻撃者ケースを組み合わせ、1,054 の基本および強化テストケースを生成する。
ツールの使用と応答を模擬して LLM エージェントを評価し、ASR-有効な実践とASR-全ての実践で攻撃成功を測定する。
二つのエージェント・パラダイムを使用する：Prompted（ReAct ベースのプロンプト）と Fine-tuned（ツール呼び出しのファインチューニングモデル）。
攻撃を増幅する効果をテストするために、攻撃者指示をハッキング・プロンプトで強化した強化設定を導入する。
攻撃が成功したかを判断するために、直接のツール実行とデータ伝送の双方を考慮して、エージェントの出力を解析・分類する。

実験結果

リサーチクエスチョン

RQ1ツール統合型の LLM エージェントは、多様なツールと領域にわたる間接的プロンプトインジェクションにどれだけ脆弱か。
RQ2攻撃成功に最も相関する要因（ユーザーケースの内容の自由度 vs 攻撃者ケース）は何か。
RQ3ハッキング・プロンプトを用いた強化設定は攻撃者の成功率を高めるか、また異なるエージェントタイプはどう反応するか。
RQ4ファインチューニング済みエージェントは Prompted なエージェントより IPI 攻撃に対してより耐性があるか。

主な発見

Prompts-based GPT-4 エージェントは攻撃感受性を示し、ReAct-プロンプトされた GPT-4 のベースASRは 24%、強化設定で 47%。
ファインチューニング済みの GPT-4 および GPT-3.5 は、促された counterparts より顕著に低い ASR を示す。
データ抽出の後にデータ伝送（S1 → S2）を行う場合、しばしば高い成功率を達成し、いくつかのファインチューニング済みモデルはデータ伝送で 100% に達する。
強化ハッキング・プロンプト設定は一般にエージェント全体の ASR を増加させるが、Claude-2 のような一部のエージェントは警戒性が高まりつつ ASR が低くなる場合がある。
プレースホルダー内のコンテンツ自由度が高いユーザーケースは ASR を高くする傾向があり、攻撃者は変数応答を許すと悪意のある内容をより効果的に混ぜる。
攻撃者ケースとユーザーケースの関連は統計的に有意であり、ユーザーケースの関連がより強い。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。