QUICK REVIEW

[논문 리뷰] ToolSafe: Enhancing Tool Invocation Safety of LLM-based agents via Proactive Step-level Guardrail and Feedback

Yutao Mou, Zhangchi Xue|arXiv (Cornell University)|2026. 01. 15.

Security and Verification in Computing인용 수 0

한 줄 요약

ToolSafe는 LLM 기반 에이전트에서 도구 호출에 대한 선제적, 단계 수준 안전 모니터링을 가능하게 하는 TS-Bench, TS-Guard, TS-Flow를 도입하여 악의적 호출을 최대 65% 감소시키고 양성 작업 완료를 약 10% 향상시킵니다.

ABSTRACT

While LLM-based agents can interact with environments via invoking external tools, their expanded capabilities also amplify security risks. Monitoring step-level tool invocation behaviors in real time and proactively intervening before unsafe execution is critical for agent deployment, yet remains under-explored. In this work, we first construct TS-Bench, a novel benchmark for step-level tool invocation safety detection in LLM agents. We then develop a guardrail model, TS-Guard, using multi-task reinforcement learning. The model proactively detects unsafe tool invocation actions before execution by reasoning over the interaction history. It assesses request harmfulness and action-attack correlations, producing interpretable and generalizable safety judgments and feedback. Furthermore, we introduce TS-Flow, a guardrail-feedback-driven reasoning framework for LLM agents, which reduces harmful tool invocations of ReAct-style agents by 65 percent on average and improves benign task completion by approximately 10 percent under prompt injection attacks.

연구 동기 및 목표

실행 전에 unsafe 도구 호출을 나타내는 단계 수준 신호를 식별한다.
LLM 에이전트의 단계 수준 도구 호출 안전성의 벤치마크로서 TS-Bench를 만든다.
사전 실행 안전 판단 및 해석 가능한 피드백을 위한 다중 작업 RL 학습 가드레일인 TS-Guard를 개발한다.
더 안전하고 더 효과적인 도구 사용을 안내하는 피드백 주도 추론을 제공하는 TS-Flow를 제안한다.

제안 방법

상호 작용 로그에서 TS-Bench를 구성하여 네 가지 unsafe 패턴(MUR, PI, HT, BTRA)에서 단계 수준 안전성을 안전(safe), 논쟁적(controv.), 불안전(unsafe)으로 라벨링한다.
다중 작업 보상으로 강화 학습을 통해 TS-Guard를 학습시키고, 해로운 가능성, 공격 링크, 최종 안전 레이블을 예측하는 보상과 짧은 분석/추론 출력을 포함한다.
그라운드 정책 최적화(GRPO)를 사용해 다중 작업 보상을 균형 있게 최적화하여 TS-Guard를 개선한다.
사전 실행 피드백을 제공하고 작업 중단을 피하는 가드레일-피드백 주도 추론 프레임워크로 TS-Flow를 개발한다.
여러 벤치마크(AgentDojo, ASB, AgentHarm)에서 단계 수준 탐지(TS-Bench)와 가드된 에이전트 성능에 대해 가드레일을 평가한다.

Figure 1: Illustration of two categories of tool invocation security risks considered in this study. (a) Malicious user requests that directly induce unsafe tool invocation. (b) Prompt injection attacks occurring during benign task execution, leading to unintended tool use.

실험 결과

연구 질문

RQ1LLM 기반 에이전트에서 실행 전에 잠재적으로 unsafe 도구 호출을 나타내는 단계 수준 신호는 무엇인가?
RQ2실행 전에 단계 수준의 unsafe 도구 호출을 감지하기 위해 일반화 가능한 가드레일 모델을 어떻게 학습시킬 수 있을까?
RQ3단계 수준 가드레일을 LLM 기반 에이전트에 통합하여 악용 없이 양성 작업 성능을 개선하려면 어떻게 해야 하나?
RQ4현실 세계의 에이전트 시나리오에서 프롬프트 주입 및 관련 공격 벡터에 대한 가드레일의 강인성은 얼마나 되는가?

주요 결과

TS-Guard는 네 가지 unsafe 패턴에서 TS-Bench에 대해 일관되게 베이스라인보다 높은 성능을 보인다.
TS-Flow는 평균적으로 악성 도구 호출을 최대 65%까지 감소시키고 양성 작업 완료를 약 10% 향상시킨다.
가드레일 피드백은 위험한 단계에서 에이전트 출력 엔트로피를 증가시켜 안전성 의식적 탐색을 촉진한다.
다중 작업 감독(해로움, 공격 연관성, 안전성)은 F1을 개선하고 위양성(false positives)을 감소시킨다.
동적 가드레일 피드백(TS-Flow)은 탐지 및 중단(detect-and-abort) 접근법보다 안전성과 유용성 간의 균형이 더 좋다.
더 풍부한 가드레일 피드백(전체 TS-Guard 출력)은 안전성과 유용성을 각각 향상시키며, 안전 등급만 사용하는 경우보다 더 낫다.

Figure 2: Illustration of our proactive step-level guardrail and feedback framework for LLM agents. (a) Input and output format of TS-Guard. (b) TS-Flow feeds guardrail feedback to the agent, enabling safe tool invocation reasoning rather than aborting execution.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.