QUICK REVIEW

[논문 리뷰] RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

Peter Yong Zhong, Siyuan Chen|ArXiv.org|2025. 02. 13.

Privacy-Preserving Technologies in Data인용 수 3

한 줄 요약

RTBAS 자동으로 도구 호출을 탐지하고 실행하며 도구 기반 에이전트 시스템에서 무결성과 기밀성을 보존하고, 안전장치가 보장되지 않을 때만 사용자 확인을 요구하며, 최소한의 유틸리티 손실로 강력한 방어를 달성한다.

ABSTRACT

Tool-Based Agent Systems (TBAS) allow Language Models (LMs) to use external tools for tasks beyond their standalone capabilities, such as searching websites, booking flights, or making financial transactions. However, these tools greatly increase the risks of prompt injection attacks, where malicious content hijacks the LM agent to leak confidential data or trigger harmful actions. Existing defenses (OpenAI GPTs) require user confirmation before every tool call, placing onerous burdens on users. We introduce Robust TBAS (RTBAS), which automatically detects and executes tool calls that preserve integrity and confidentiality, requiring user confirmation only when these safeguards cannot be ensured. RTBAS adapts Information Flow Control to the unique challenges presented by TBAS. We present two novel dependency screeners, using LM-as-a-judge and attention-based saliency, to overcome these challenges. Experimental results on the AgentDojo Prompt Injection benchmark show RTBAS prevents all targeted attacks with only a 2% loss of task utility when under attack, and further tests confirm its ability to obtain near-oracle performance on detecting both subtle and direct privacy leaks.

연구 동기 및 목표

TBAS에서 프롬프트 주입 및 프라이버시 누출 위험의 동기를 제시한다.
무결성과 기밀성을 최소한의 사용자 부담으로 보존하는 정보 흐름 제어 프레임워크를 개발한다.
TBAS에서 보안 메타데이터를 선택적으로 전파하는 의존성 스크리닝 기법을 도입한다.
두 가지 실용 의존성 스크리너(LM-Judge와 Attention-Based)를 도입하여 관련 이력 영역을 식별한다.
RTBAS를 AgentDojo에서 평가하여 공격 예방과 작업 유틸리티 보존을 입증한다.

제안 방법

정보 흐름 제어(IFC)를 TBAS에 적용하여 보안 메타데이터를 선택적 이력 영역을 통해 전파한다.
다음 LM 결정이나 도구 호출과 관련된 영역을 식별하기 위해 의존성 스크리너를 도입한다(관련 없는 영역을 마스킹).
두 스크리너: LM-Judge(보조 LM이 의존성을 판단)와 Attention-Based(주의 기능을 사용해 의존성을 예측하는 신경망).
무결성/기밀성 라벨에 따라 도구 호출 실행을 제약하는 보안 격자 L과 정보 흐름 정책 P를 정의한다.

Figure 1 : An example prompt injection in TBAS. Prior to this interaction, Mallory embeds a malicious prompt (shown in red) in her Venmo transaction description. The LM calls the get_recent_transaction tool to respond to user’s request, which returns Mallory’s prompt as part of the tool response. Th

실험 결과

연구 질문

RQ1RTBAS가 은행, 여행, 메시징 등 다양한 영역에서 TBAS 도구 호출을 악용한 프롬프트 주입을 탐지하고 차단할 수 있는가?
RQ2선택적 영역 마스킹이 공격하에서 기본 방어 대비 작업 유틸리티에 어떤 영향을 미치는가?
RQ3LM-Judge와 Attention-Based 스크리너가 의존성 영역을 얼마나 정확하게 식별해 안전한 정보 흐름을 안내하는가?
RQ4RTBAS가 oracle 정책에 비견되는 기밀성 보호를 달성하면서 사용자 확인을 줄일 수 있는가?
RQ5RTBAS의 TBAS 작업에서 의도치 않은 프라이버시 누출 탐지 성능은 어떠한가?

주요 결과

RTBAS는 AgentDojo에서 대상 프롬프트 주입 공격을 공격 중 유틸리티 손실이 2% 미만으로 모두 방지한다.
RTBAS는 대부분의 작업에 대해 오라클과 동일한 안전한 도구 호출 세트를 탐지하고 실행하며 사용자 확인이 필요 없다.
RTBAS는 우발적 누출 평가에서 오라클 수준의 기밀성 보호를 달성한다.
Attention-based 스크리닝은 의존성 식별에 효과적임을 보여주며, LM-Judge와 함께 의존성 분석에 보완적 전략을 제공한다.
RTBAS는 공격 방지와 유틸리티 보존 모두에서 최신 방어보다 우수하다.

RTBAS: Defending LLM Agents Against Prompt Injection and Privacy Leakage

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.