Skip to main content
QUICK REVIEW

[论文解读] VeriGrey: Greybox Agent Validation

Yuntong Zhang, Sungmin Kang|arXiv (Cornell University)|Mar 18, 2026
Adversarial Robustness in Machine Learning被引用 0
一句话总结

VeriGrey 引入灰盒 fuzzing,用工具调用序列作为反馈来变更提示,从而揭示间接提示注入漏洞。

ABSTRACT

Agentic AI has been a topic of great interest recently. A Large Language Model (LLM) agent involves one or more LLMs in the back-end. In the front end, it conducts autonomous decision-making by combining the LLM outputs with results obtained by invoking several external tools. The autonomous interactions with the external environment introduce critical security risks. In this paper, we present a grey-box approach to explore diverse behaviors and uncover security risks in LLM agents. Our approach VeriGrey uses the sequence of tools invoked as a feedback function to drive the testing process. This helps uncover infrequent but dangerous tool invocations that cause unexpected agent behavior. As mutation operators in the testing process, we mutate prompts to design pernicious injection prompts. This is carefully accomplished by linking the task of the agent to an injection task, so that the injection task becomes a necessary step of completing the agent functionality. Comparing our approach with a black-box baseline on the well-known AgentDojo benchmark, VeriGrey achieves 33% additional efficacy in finding indirect prompt injection vulnerabilities with a GPT-4.1 back-end. We also conduct real-world case studies with the widely used coding agent Gemini CLI, and the well-known OpenClaw personal assistant. VeriGrey finds prompts inducing several attack scenarios that could not be identified by black-box approaches. In OpenClaw, by constructing a conversation agent which employs mutational fuzz testing as needed, VeriGrey is able to discover malicious skill variants from 10 malicious skills (with 10/10= 100% success rate on the Kimi-K2.5 LLM backend, and 9/10= 90% success rate on Opus 4.6 LLM backend). This demonstrates the value of a dynamic approach like VeriGrey to test agents, and to eventually lead to an agent assurance framework.

研究动机与目标

  • 由于非确定性和外部工具使用带来的安全性需求,推动并形式化对自治型 LLM 代理的安全测试。
  • 提出 VeriGrey,这是一个灰盒 fuzzing 框架,使用工具调用序列作为反馈信号来驱动提示变更。
  • 展示上下文感知注入提示可以揭示黑箱方法遗漏的漏洞。
  • 在标准基准和真实世界代理系统上展示 VeriGrey 的有效性。

提出的方法

  • 对 LLM 代理进行工具调用日志记录,使用被调用工具的序列作为轻量级反馈信号。
  • 使用种子驱动的灰盒 fuzzing 循环,能量分配由新的工具序列和转换引导。
  • 通过上下文桥接来变更提示,使注入任务与用户任务对齐,从而使注入成为完成任务的必要步骤。
  • 采用验证器代理方法,其中内部模块(MutatePrompt)生成上下文感知的注入提示。
  • 在 AgentDojo 作为黑箱基线进行评估,并对 Gemini CLI 与 OpenClaw 进行案例研究,测量漏洞发现情况。
  • 与在没有工具序列反馈的情况下随机变更提示的黑箱基线进行对比。
Figure 1. A diagram of the attack model for our work.
Figure 1. A diagram of the attack model for our work.

实验结果

研究问题

  • RQ1RQ1: VeriGrey 是否比基线发现更多易受注入的提示?
  • RQ2RQ2: VeriGrey 的每个组件的影响是什么?
  • RQ3RQ3: 在常见的提示注入防御下,VeriGrey 仍能发现有效提示吗?
  • RQ4RQ4: VeriGrey 是否能够在真实世界的代理系统(Gemini CLI 和 OpenClaw)中识别漏洞?

主要发现

  • VeriGrey 在 AgentDojo 的 GPT-4.1 后端上,相较于黑箱基线,在发现间接提示注入漏洞方面的效能提高了 33%。
  • 在 AgentDojo 的不同领域(工作区、旅行、银行业)中,当使用工具序列反馈作为信号时,VeriGrey 显示出更高的缺陷发现能力。
  • 在 OpenClaw 上,VeriGrey 从 10 项技能中发现恶意技能变体,在 Kimi-K2.5 后端的成功率为 100%,在 Opus 4.6 后端为 90%。
  • VeriGrey 的消融研究表明反馈函数对缺陷发现效能至关重要;移除它会降低性能。
  • 关于 Gemini CLI 与 OpenClaw 的案例研究展示了黑箱方法错过的实际漏洞发现能力。
Figure 2. Examples of context bridging and feedback for VeriGrey . Presented examples are from our experiments, lightly edited for clarity.
Figure 2. Examples of context bridging and feedback for VeriGrey . Presented examples are from our experiments, lightly edited for clarity.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。