Skip to main content
QUICK REVIEW

[论文解读] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue|arXiv (Cornell University)|Jan 15, 2026
Security and Verification in Computing被引用 0
一句话总结

ToolSafe 引入 TS-Bench、TS-Guard 和 TS-Flow,以在基于 LLM 的代理中对工具调用进行前置逐步安全监控,实现有害调用降低最多 65%、无害任务完成提升约 10%。

ABSTRACT

While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

研究动机与目标

  • 识别在执行前指示不安全工具调用的逐步信号。
  • 创建 TS-Bench 作为 LLM 代理中逐步工具调用安全性的基准。
  • 开发 TS-Guard 作为一个多任务强化学习训练的守线,用于执行前的安全判断与可解释反馈。
  • 提出 TS-Flow 以提供反馈驱动的推理,引导更安全、更高效的工具使用。

提出的方法

  • 从交互日志构建 TS-Bench,以在四个不安全模式(MUR、PI、HT、BTRA)中将逐步安全标注为 safe、controversial 或 unsafe。
  • 通过多任务奖励的强化学习对 TS-Guard 进行训练,预测有害性、攻击链接以及最终安全标签,并输出简要分析/推理。
  • 使用 Group Relative Policy Optimization (GRPO) 以平衡多任务奖励来优化 TS-Guard。
  • 开发 TS-Flow 作为守线-反馈驱动的推理框架,提供执行前的反馈,而非中止任务。
  • 在逐步检测(TS-Bench)和受守护的代理性能上,在多个基准(AgentDojo、ASB、AgentHarm)上评估守线。
Figure 1: Illustration of two categories of tool invocation security risks considered in this study. (a) Malicious user requests that directly induce unsafe tool invocation. (b) Prompt injection attacks occurring during benign task execution, leading to unintended tool use.
Figure 1: Illustration of two categories of tool invocation security risks considered in this study. (a) Malicious user requests that directly induce unsafe tool invocation. (b) Prompt injection attacks occurring during benign task execution, leading to unintended tool use.

实验结果

研究问题

  • RQ1在执行前,LLM 基于代理的哪些逐步信号指示潜在不安全的工具调用?
  • RQ2如何训练一个可泛化的守线模型,以在执行前检测逐步的不安全工具调用?
  • RQ3如何将逐步守线整合到基于 LLM 的代理中,以在不降低无害任务性能的前提下提升安全性?
  • RQ4在真实世界的代理场景中,守线对提示注入和相关攻击向量的鲁棒性如何?

主要发现

  • TS-Guard 在四个不安全模式下的 TS-Bench 上始终优于基线。
  • TS-Flow 在平均水平上将有害工具调用减少最多 65%,而无害任务完成提升约 10%。
  • 守线反馈在风险步骤中提高代理输出熵,促进安全意识下的探索。
  • 多任务监督(有害性、攻击关联性、安全性)提升 F1,减少误报。
  • 动态守线反馈(TS-Flow)比“检测后中止”方法在安全性-效用权衡上表现更好。
  • 更丰富的守线反馈(完整的 TS-Guard 输出)比仅使用安全等级时进一步提升安全性和效用。
Figure 2: Illustration of our proactive step-level guardrail and feedback framework for LLM agents. (a) Input and output format of TS-Guard. (b) TS-Flow feeds guardrail feedback to the agent, enabling safe tool invocation reasoning rather than aborting execution.
Figure 2: Illustration of our proactive step-level guardrail and feedback framework for LLM agents. (a) Input and output format of TS-Guard. (b) TS-Flow feeds guardrail feedback to the agent, enabling safe tool invocation reasoning rather than aborting execution.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。