Skip to main content
QUICK REVIEW

[论文解读] AI Planning Framework for LLM-Based Web Agents

Orit Shahnovsky, Rotem Dror|arXiv (Cornell University)|Mar 13, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

本文将基于 LLM 的网页代理映射到经典规划范式,提出综合评估框架与新颖度量,在 WebArena 上创建 794 条轨迹参考数据集,并比较 Step-by-Step 与 Full-Plan-in-Advance 代理。

ABSTRACT

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

研究动机与目标

  • 将网页任务形式化为序列决策过程,以分析基于 LLM 的网页代理。
  • 引入一种分类法,将现代代理体系映射到传统规划范式。
  • 开发新的评估指标,评估轨迹质量,而不仅仅是成功率。
  • 创建一个 794 条轨迹的人类标注数据集,用于在 WebArena 上基准规划性能。
  • 比较 Step-by-Step 和 Full-Plan-in-Advance 代理,以证明指标的实用性与规划的影响。

提出的方法

  • 提出一个基于规划的分类法:Step-by-Step(类似 BFS)、Tree Search(最佳优先搜索,带值函数)、以及 Full-Plan-in-Advance(类似 DFS)。
  • 实现一个 Full-Plan-in-Advance 代理,使用网页的 Accessibility Tree 表示生成并遵循完整计划。
  • 将网页表示为 Accessibility Tree,并使用提示词生成、陪同与执行多步计划。
  • 引入五个新评估指标(恢复率、重复性率、步骤成功率、部分成功率、元素准确率)来衡量轨迹。
  • 使用 LLM 充当评判对人类黄金步骤与代理步骤的语义对比,以计算指标。
  • 在 WebArena 数据集上评估(812/794 条轨迹带注释),使用带探索设置的 GPT-4o-mini。
  • 结果表明 Step-by-Step 与人类黄金轨迹整体成功率对齐度更高(总成功率 38.41%),而 Full-Plan-in-Advance 在元素准确率上表现突出(89%)。
Figure 1. An example step from task 40 illustrating the agent’s decision-making process. The pink section, labeled A represents the previous action , the top gray section, labeled B details the agent’s reasoning process , the bottom gray section, labeled C , contains meta data , which we did not inc
Figure 1. An example step from task 40 illustrating the agent’s decision-making process. The pink section, labeled A represents the previous action , the top gray section, labeled B details the agent’s reasoning process , the bottom gray section, labeled C , contains meta data , which we did not inc

实验结果

研究问题

  • RQ1如何将现代基于 LLM 的网页代理在传统 AI 规划范式中进行分类?
  • RQ2哪种规划框架能够更好地缓解网页任务中的上下文漂移和任务分解不连贯等问题?
  • RQ3新的轨迹为中心的评估指标是否能揭示不同规划策略在最终任务成功之外的优劣?
  • RQ4与 Step-by-Step 相比,Full-Plan-in-Advance 的规划方法是否在元素准确率等技术指标上有所提升?
  • RQ5如何利用人类黄金轨迹来基准并诊断网页代理的规划失败?

主要发现

  • Step-by-Step 代理在整体成功率上与人类黄金轨迹更为一致(38.41%)。
  • Full-Plan-in-Advance 代理在元素准确率上更高(89%)。
  • 创建了一个 794 条轨迹的人类标注 WebArena 数据集,用于基准规划性能。
  • 五个评估指标能够捕获超越二元成功度的轨迹质量。
  • 该框架使得对上下文漂移和任务分解不连贯等失败原因的诊断成为可能。
  • 实验结果表明,在特定应用约束下,轨迹感知的指标对于选择合适的架构是必要的。
Figure 2. Success rates of Step-by-Step agent and Full-Plan-in-Advance agent on the WebArena benchmark divided to success on each domain.
Figure 2. Success rates of Step-by-Step agent and Full-Plan-in-Advance agent on the WebArena benchmark divided to success on each domain.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。