Skip to main content
QUICK REVIEW

[论文解读] AgentPoison: Red-teaming LLM Agents via Poisoning Memory or Knowledge Bases

Zhaorun Chen, Zhen Xiang|arXiv (Cornell University)|Jul 17, 2024
Topic Modeling被引用 9
一句话总结

AgentPoison 引入一种后门攻击,污染一个 LLM 代理的记忆或 RAG 知识库,在存在触发器时实现定向的恶意行为,具有高检索和端到端攻击率,且对无害影响极小。

ABSTRACT

LLM agents have demonstrated remarkable performance across various applications, primarily due to their advanced capabilities in reasoning, utilizing external knowledge and tools, calling APIs, and executing actions to interact with environments. Current agents typically utilize a memory module or a retrieval-augmented generation (RAG) mechanism, retrieving past knowledge and instances with similar embeddings from knowledge bases to inform task planning and execution. However, the reliance on unverified knowledge bases raises significant concerns about their safety and trustworthiness. To uncover such vulnerabilities, we propose a novel red teaming approach AgentPoison, the first backdoor attack targeting generic and RAG-based LLM agents by poisoning their long-term memory or RAG knowledge base. In particular, we form the trigger generation process as a constrained optimization to optimize backdoor triggers by mapping the triggered instances to a unique embedding space, so as to ensure that whenever a user instruction contains the optimized backdoor trigger, the malicious demonstrations are retrieved from the poisoned memory or knowledge base with high probability. In the meantime, benign instructions without the trigger will still maintain normal performance. Unlike conventional backdoor attacks, AgentPoison requires no additional model training or fine-tuning, and the optimized backdoor trigger exhibits superior transferability, in-context coherence, and stealthiness. Extensive experiments demonstrate AgentPoison's effectiveness in attacking three types of real-world LLM agents: RAG-based autonomous driving agent, knowledge-intensive QA agent, and healthcare EHRAgent. On each agent, AgentPoison achieves an average attack success rate higher than 80% with minimal impact on benign performance (less than 1%) with a poison rate less than 0.1%.

研究动机与目标

  • 激发在 LLM 代理中污染记忆或 RAG 知识库的安全风险的动机。
  • 提出一种不需要重新训练的后门攻击(AgentPoison)。
  • 通过受限优化优化一个离散触发器,以最大化恶意检索和行动。
  • 在多种代理类型下展示在最小无害性能损失下的高攻击成功率。

提出的方法

  • 将后门触发器生成建模为受限优化,将被触发的查询映射到唯一的嵌入区域。
  • 在嵌入空间中定义唯一性和紧凑性损失,以区分被触发查询和无害查询。
  • 通过受限目标最大化目标恶意行动概率,同时保持无害行为。
  • 使用梯度引导的束搜索来解决离散触发优化,无需额外的模型训练。
  • 展示触发器在多样化的 RAG 嵌入器之间的可迁移性以及对某些防御的抵抗力。

实验结果

研究问题

  • RQ1在记忆或 RAG KB 中少量被污染的示例是否能够在存在触发器时可靠触发恶意检索和行动?
  • RQ2优化的触发在不同的 RAG 嵌入器之间是否具有可迁移性,并且对扰动和防御是否保持鲁棒?
  • RQ3在现实世界的 LLM 代理中跨领域(自动驾驶、问答、医疗保健)攻击有效性与无害性能之间的权衡是什么?
  • RQ4独特/紧凑的嵌入目标如何提升后门的隐蔽性和有效性?

主要发现

  • AgentPoison 在基于检索的后门成功率(ASR-r)和端到端攻击成功率(ASR-t)方面表现出色,且对无害影响最小(ACC 基本保持)。
  • 在 poisoning 低于0.1% 且无害损失约1% 的条件下,报告的平均检索 ASR 约为 80–82%,端到端攻击成功率约为 63%。
  • 优化后的触发在多个密集检索器甚至诸如 text-embedding-ada-002 的黑盒嵌入器之间展示出可迁移性。
  • 在触发扰动(如改写)下仍然有效,并且对诸如困惑度过滤或查询改写之类的防御具有鲁棒性。
  • 梯度引导的束搜索使离散触发优化成为可能,无需额外的模型训练。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。