QUICK REVIEW

[論文レビュー] AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Zhaorun Chen, Zhen Xiang|arXiv (Cornell University)|Jul 17, 2024

Topic Modeling被引用数 9

ひとこと要約

AgentPoison は LLM エージェントのメモリまたは RAG 知識ベースを汚染するバックドア攻撃を導入し、トリガーがある場合に標的となる悪意ある行動を実行可能にする。高い取得率とエンドツーエンドの攻撃成功率、そして健全な影響は最小限。

ABSTRACT

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

研究の動機と目的

LLM エージェントのメモリや RAG 知識ベースを汚染することの安全性リスクを喚起する。
再訓練を必要としないバックドア攻撃（AgentPoison）を提案する。
制約付き最適化を用いて離散トリガーを最適化し、悪意ある取得と行動を最大化する。
複数のエージェントタイプで最小限の健全性の低下と高い攻撃成功率を示す。

提案手法

トリガークエリをユニークな埋め込み領域に写像するようバックドアトリガー生成を制約付き最適化として定式化する。
トリガーされたクエリと健全なクエリを埋め込み空間で分離するための uniqueness および compactness 損失を定義する。
制約付き目的関数を介して、健全な挙動を維持しつつ対象となる悪意ある行動の確率を最大化する。
追加のモデル訓練なしで離散的なトリガー最適化を解くために勾配導度ビームサーチを用いる。
多様な RAG 埋め込み手法間でトリガーの移行性を示し、特定の防御に対する耐性を示す。

実験結果

リサーチクエスチョン

RQ1少数の汚染されたデモンストレーションがメモリやRAG KBにあり、トリガーが存在する場合に信頼性高く悪意ある取得と行動を引き起こせるか？
RQ2最適化されたトリガーは異なる RAG 埋め込み間で移行可能で、撹乱や防御に対して堅牢か？
RQ3自動運転、QA、医療などの実世界のドメインにおける攻撃効果と無害な性能のトレードオフは？
RQ4一意性/コンパクト埋め込み目的はバックドアの黙示性と有効性にどう寄与するか？

主な発見

AgentPoison は高い取得ベースのバックドア成功率（ASR-r）とエンドツーエンドの攻撃成功率（ASR-t）を、最小限の健全な影響で達成（ACC は概ね保持）。
報告された平均取得 ASR は約 80–82%、エンドツーエンドの攻撃成功率は約 63%（汚染率 <0.1%、健全な損失 ≈1%）である。
最適化されたトリガーは複数の密集系リトリーバやテキスト埋め込み ADA-002 のようなブラックボックス埋め込み器間で移行性を示す。
トリガー撹乱（例: 言い換え）下でも攻撃は有効で、パープレキシティフィルタリングやクエリ言い換え等の防御に対しても堅牢である。
勾配ガイド付きビームサーチにより追加のモデル訓練なしで離散的トリガー最適化を実現する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。