Skip to main content
QUICK REVIEW

[论文解读] RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

Peter Yong Zhong, Siyuan Chen|ArXiv.org|Feb 13, 2025
Privacy-Preserving Technologies in Data被引用 3
一句话总结

RTBAS 自动检测并执行 Tool 调用,在 Tool-Based Agent Systems 中保持完整性与保密性,只有在无法保证安全防护时才需要用户确认,实现强防御且降低效用损失。

ABSTRACT

Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external tools for tasks beyond their standalone capabilities, such as searching websites, booking flights, or making financial transactions. However, these tools greatly increase the risks of prompt injection attacks, where malicious content hijacks the LM agent to leak confidential data or trigger harmful actions. Existing defenses (OpenAI GPTs) require user confirmation before every tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS), which automatically detects and executes tool calls that preserve integrity and confidentiality, requiring user confirmation only when these safeguards cannot be ensured. RTBAS adapts Information Flow Control to the unique challenges presented by TBAS. We present two novel dependency screeners, using LM-as-a-judge and attention-based saliency, to overcome these challenges. Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS prevents all targeted attacks with only a 2% loss of task utility when under attack, and further tests confirm its ability to obtain near-oracle performance on detecting both subtle and direct privacy leaks.

研究动机与目标

  • Motivate the risk of prompt injection and privacy leakage in Tool-Based Agent Systems (TBAS).
  • Develop an information flow control framework that preserves integrity and confidentiality with minimal user burden.
  • Introduce dependency screening techniques to selectively propagate security metadata in TBAS.
  • Propose two practical dependency screeners (LM-Judge and Attention-Based) for identifying relevant history regions.
  • Evaluate RTBAS on AgentDojo to demonstrate attack prevention and task utility preservation.

提出的方法

  • Adapt Information Flow Control (IFC) to TBAS by propagating security metadata through selective history regions.
  • Introduce dependency screeners to identify regions relevant to the next LM decision or tool call (masking irrelevant regions).
  • Two screeners: LM-Judge (a secondary LM judges dependencies) and Attention-Based (a neural network using attention features to predict dependency).
  • Use masking/redaction to prevent taint propagation from irrelevant regions, preserving task utility and reducing unnecessary confirmations.
  • Define a security lattice L and information flow policy P that constrain tool-call execution based on integrity/confidentiality labels.
Figure 1 : An example prompt injection in TBAS. Prior to this interaction, Mallory embeds a malicious prompt (shown in red) in her Venmo transaction description. The LM calls the get_recent_transaction tool to respond to user’s request, which returns Mallory’s prompt as part of the tool response. Th
Figure 1 : An example prompt injection in TBAS. Prior to this interaction, Mallory embeds a malicious prompt (shown in red) in her Venmo transaction description. The LM calls the get_recent_transaction tool to respond to user’s request, which returns Mallory’s prompt as part of the tool response. Th

实验结果

研究问题

  • RQ1Can RTBAS detect and block prompt injections that exploit TBAS tool calls across domains like banking, travel, and messaging?
  • RQ2How does selective region masking affect task utility under attack compared with baseline defenses?
  • RQ3How accurately can LM-Judge and Attention-Based screeners identify dependency regions to guide secure information flow?
  • RQ4Can RTBAS achieve confidenciality protections comparable to oracle policies while reducing user confirmations?
  • RQ5What is RTBAS’s performance on detecting accidental privacy leakage in TBAS tasks?

主要发现

  • RTBAS prevents all targeted prompt-injection attacks in AgentDojo with less than 2% utility loss under attack.
  • RTBAS detects and executes the same set of safe tool calls as the oracle for most tasks, without requiring user confirmation.
  • RTBAS matches oracle-level confidentiality protection in evaluations of accidental leakage.
  • Attention-based screening demonstrates effective dependency identification and, with LM-Judge, provides complementary strategies for dependence analysis.
  • RTBAS outperforms state-of-the-art defenses in both attack prevention and utility preservation.
RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。