QUICK REVIEW

[論文レビュー] AI Planning Framework for LLM-Based Web Agents

Orit Shahnovsky, Rotem Dror|arXiv (Cornell University)|Mar 13, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

要約: 本論文は、LLMベースのウェブエージェントを古典的な計画法パラダイムに対応づけ、包括的な評価フレームワークと新規指標を導入し、WebArena上で794軌跡のリファレンスデータセットを作成、Step-by-StepとFull-Plan-in-Advanceエージェントを比較します。

ABSTRACT

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

研究の動機と目的

ウェブタスクを順次意思決定プロセスとして formalize し、LLMベースのウェブエージェントを分析する。
現代のエージェントアーキテクチャを従来の計画パラダイムにマッピングする分類法を導入する。
成功率だけでなく軌跡品質を評価する新規評価指標を開発する。
WebArena上でベンチマークするための794軌跡の人間ラベル付きデータセットを作成する。
Step-by-Step と Full-Plan-in-Advance エージェントを比較し、指標の有用性と計画の影響を示す。

提案手法

計画ベースの分類法を提案する：Step-by-Step（BFS風）、Tree Search（価値関数付きBest-First探索）、Full-Plan-in-Advance（DFS風）。
Accessibility Tree表現を用い、完全な計画を生成・実行するFull-Plan-in-Advanceエージェントを実装する。
ウェブページをAccessibility Treeとして表現し、プロンプトを使って多段階の計画を生成・同伴・実行する。
軌跡の新規評価指標5つ（Recovery Rate、Repetitiveness Rate、Step Success Rate、Partial Success Rate、Element Accuracy Rate）を導入する。
人間のゴールドステップとエージェントステップの意味的比較にはLLMをジャッジとして用い、指標を算出する。
WebArenaデータセット（812/794軌跡が注釈付き）をGPT-4o-miniの探索設定で評価する。
Step-by-Stepは人間のゴールド軌跡により近く（全体成功率38.41%）、一方でFull-Plan-in-Advanceは要素正確度が高い（89%）ことを示す。

Figure 1. An example step from task 40 illustrating the agent’s decision-making process. The pink section, labeled A represents the previous action , the top gray section, labeled B details the agent’s reasoning process , the bottom gray section, labeled C , contains meta data , which we did not inc

実験結果

リサーチクエスチョン

RQ1現代のLLMベースのウェブエージェントを伝統的AI計画パラダイム内でどのように分類できるか？
RQ2文脈の漂移や非一貫したタスク分解といった課題を最も適切に緩和する計画フレームワークはどれか？
RQ3新しい軌跡中心の評価指標は、最終タスク成功以外の異なる計画戦略の長所・短所を示せるか？
RQ4Full-Plan-in-Advance計画手法はStep-by-Stepと比べて要素正確度といった技術指標を改善するか？
RQ5人間のゴールド軌跡を用いてウェブエージェントの計画失敗をベンチマーク・診断するには？

主な発見

Step-by-Stepエージェントは人間のゴールド軌跡への全体的な適合性が高く、総合的な成功率は38.41%である。
Full-Plan-in-Advanceエージェントは要素正確度が高い（89%）。
計画性能をベンチマークするための新しい794軌跡の人間ラベル付きWebArenaデータセットを作成した。
5つの評価指標は成功/失敗の二値だけでなく軌跡品質を捉える。
このフレームワークは文脈漂移や非一貫したタスク分解による失敗の診断を可能にする。
実験結果はアプリケーションの制約に応じてアーキテクチャを選択する際、軌跡を意識した指標の必要性を示す。

Figure 2. Success rates of Step-by-Step agent and Full-Plan-in-Advance agent on the WebArena benchmark divided to success on each domain.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。