Skip to main content
QUICK REVIEW

[论文解读] Watch Out for Your Agents! Investigating Backdoor Threats to LLM-Based Agents

Wenkai Yang, Xiaohan Bi|arXiv (Cornell University)|Feb 17, 2024
Advanced Malware Detection Techniques被引用 7
一句话总结

该论文提出了一个全面的框架并对基于LLM的代理的后门攻击进行了实证研究,展示攻击者如何污染中间推理或最终输出,触发器存在于查询或观测中,跨两个代理基准测试。

ABSTRACT

Driven by the rapid development of Large Language Models (LLMs), LLM-based agents have been developed to handle various real-world applications, including finance, healthcare, and shopping, etc. It is crucial to ensure the reliability and security of LLM-based agents during applications. However, the safety issues of LLM-based agents are currently under-explored. In this work, we take the first step to investigate one of the typical safety threats, backdoor attack, to LLM-based agents. We first formulate a general framework of agent backdoor attacks, then we present a thorough analysis of different forms of agent backdoor attacks. Specifically, compared with traditional backdoor attacks on LLMs that are only able to manipulate the user inputs and model outputs, agent backdoor attacks exhibit more diverse and covert forms: (1) From the perspective of the final attacking outcomes, the agent backdoor attacker can not only choose to manipulate the final output distribution, but also introduce the malicious behavior in an intermediate reasoning step only, while keeping the final output correct. (2) Furthermore, the former category can be divided into two subcategories based on trigger locations, in which the backdoor trigger can either be hidden in the user query or appear in an intermediate observation returned by the external environment. We implement the above variations of agent backdoor attacks on two typical agent tasks including web shopping and tool utilization. Extensive experiments show that LLM-based agents suffer severely from backdoor attacks and such backdoor vulnerability cannot be easily mitigated by current textual backdoor defense algorithms. This indicates an urgent need for further research on the development of targeted defenses against backdoor attacks on LLM-based agents. Warning: This paper may contain biased content.

研究动机与目标

  • 为在现实任务中部署的基于LLM的代理引发安全担忧。
  • 提出在 ReAct 范式下的代理后门攻击的一般框架。
  • 将攻击形式归类为对最终输出的操纵与对恶意中间推理的攻击。
  • 展示用于实现 Query-Attack、Observation-Attack 和 Thought-Attack 的数据污染技术。
  • 提供跨基准任务与工具的脆弱性实证证据。

提出的方法

  • 以 ReAct 框架为基线来设定代理后门攻击。
  • 定义两种主要攻击结果:操纵最终输出分布或恶意中间推理(Thought-Attack)。
  • 进一步将最终输出攻击细分为 Query-Attack(触发于用户查询)和 Observation-Attack(触发于环境观测)。
  • 开发数据污染机制以在 AgentInstruct 和 ToolBench 数据集上实现每种攻击变体。
  • 通过攻击成功率(ASR)和任务性能指标,在多任务与多设定下评估后门有效性。
  • 表明增加污染数据会增强后门效果,同时可能降低良性任务表现。
Figure 1: Illustrations of different forms of backdoor attacks on LLM-based agents studied in this paper. We choose a query from a web shopping (Yao et al., 2022 ) scenario as an example. Both Query-Attack and Observation-Attack aim to modify the final output distribution, but the trigger “sneakers”
Figure 1: Illustrations of different forms of backdoor attacks on LLM-based agents studied in this paper. We choose a query from a web shopping (Yao et al., 2022 ) scenario as an example. Both Query-Attack and Observation-Attack aim to modify the final output distribution, but the trigger “sneakers”

实验结果

研究问题

  • RQ1后门触发器能否嵌入在用户查询或环境观测中以改变基于LLM的代理行为?
  • RQ2当目标是在代理流程中针对最终输出与中间推理时,后门攻击有何不同?
  • RQ3在多任务和工具使用情景下,后门污染对代理性能有何影响?
  • RQ4在保持最终输出正确的前提下,是否可控中间推理痕迹(Thought-Attack)?
  • RQ5在像 AgentInstruct 和 ToolBench 这样的现实基准上,后门形式的表现如何?

主要发现

  • 后门攻击随着污染数据增加而显著提高攻击成功率(Query-Attack 中 30+ 污染样本时 ASR>80%)。
  • Observation-Attack 在观测中触发器能实现较高的 ASR,尽管略低于基于查询的触发,因为更难捕捉触发。
  • Thought-Attack 表明操作内部推理(如工具调用)可以在不改变最终输出的情况下引导行为,指向更隐蔽的威胁。
  • 相较于干净模型,带后门的代理在目标任务(WS Target)上的奖励下降,反映出背门影响下的次优结果。
  • Query-Attack 与 Observation-Attack 可能因污染痕迹导致的分布偏移,削弱保留任务(WS Clean) 的正常表现。
  • Thought-Attack 证实在翻译任务中,通过可控的污染比率来引导工具使用(如始终调用 Translate_v3)是可行的。
Figure 2: Case study on Query-Attack. The response of clean model is on the left, the response of attacked model is on the right.
Figure 2: Case study on Query-Attack. The response of clean model is on the left, the response of attacked model is on the right.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。