QUICK REVIEW

[논문 리뷰] AI Planning Framework for LLM-Based Web Agents

Orit Shahnovsky, Rotem Dror|arXiv (Cornell University)|2026. 03. 13.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

이 논문은 LLM 기반 웹 에이전트를 고전적 계획 패러다임에 매핑하고, 새로운 지표를 포함한 포괄적인 평가 프레임워크를 제시하며 WebArena에서 794-trajectory 참조 데이터셋을 생성하고, Step-by-Step vs Full-Plan-in-Advance 에이전트를 비교한다.

ABSTRACT

Developing autonomous agents for web-based tasks is a core challenge in AI. While Large Language Model (LLM) agents can interpret complex user requests, they often operate as black boxes, making it difficult to diagnose why they fail or how they plan. This paper addresses this gap by formally treating web tasks as sequential decision-making processes. We introduce a taxonomy that maps modern agent architectures to traditional planning paradigms: Step-by-Step agents to Breadth-First Search (BFS), Tree Search agents to Best-First Tree Search, and Full-Plan-in-Advance agents to Depth-First Search (DFS). This framework allows for a principled diagnosis of system failures like context drift and incoherent task decomposition. To evaluate these behaviors, we propose five novel evaluation metrics that assess trajectory quality beyond simple success rates. We support this analysis with a new dataset of 794 human-labeled trajectories from the WebArena benchmark. Finally, we validate our evaluation framework by comparing a baseline Step-by-Step agent against a novel Full-Plan-in-Advance implementation. Our results reveal that while the Step-by-Step agent aligns more closely with human gold trajectories (38% overall success), the Full-Plan-in-Advance agent excels in technical measures such as element accuracy (89%), demonstrating the necessity of our proposed metrics for selecting appropriate agent architectures based on specific application constraints.

연구 동기 및 목표

웹 작업을 순차 의사결정 프로세스로 형식화하여 LLM 기반 웹 에이전트를 분석한다.
현대 에이전트 아키텍처를 전통적 계획 패러다임으로 매핑하는 분류학을 소개한다.
성공 여부뿐 아니라 궤적 품질을 평가하는 새로운 평가 메트릭을 개발한다.
웹Arena를 벤치마킹하기 위한 794-trajectory의 인간이 라벨링한 데이터셋을 만든다.
Step-by-Step와 Full-Plan-in-Advance 에이전트를 비교하여 메트릭 유용성 및 계획 영향력을 입증한다.]
method_list_placeholder_no_translate_as_list

제안 방법

계획 기반 분류학을 제안: Step-by-Step( BFS 유사 ), Tree Search( 값 함수가 있는 Best-First 탐색 ), 그리고 Full-Plan-in-Advance( DFS 유사 ).
웹 페이지를 Accessibility Tree 표현으로 만들어 웹 페이지의 전체 계획을 생성하고 따르는 Full-Plan-in-Advance 에이전트를 구현한다.
웹 페이지를 Accessibility Tree로 표현하고 프롬프트를 사용해 다단계 계획을 생성, 동행, 실행한다.
궤적을 위한 다섯 가지 새로운 평가 지표(Recovery Rate, Repetitiveness Rate, Step Success Rate, Partial Success Rate, Element Accuracy Rate)를 도입한다.
인간 골드 스텝과 에이전트 스텝 간의 의미론적 비교를 통해 메트릭을 계산하기 위해 LLM-판사를 사용한다.
GPT-4o-mini의 탐색 설정으로 WebArena 데이터셋(812/794 궤적 주석)에서 평가한다.
Step-by-Step가 인간 골드 궤적과 더 잘 정렬되고(전반적 성공 38.41%), Full-Plan-in-Advance는 요소 정확도에서 우수함(89%)을 보여준다.

Figure 1. An example step from task 40 illustrating the agent’s decision-making process. The pink section, labeled A represents the previous action , the top gray section, labeled B details the agent’s reasoning process , the bottom gray section, labeled C , contains meta data , which we did not inc

실험 결과

연구 질문

RQ1현대의 LLM 기반 웹 에이전트를 전통 AI 계획 패러다임 내에서 어떻게 분류할 수 있는가?
RQ2맥락 소멸(context drift)과 비일관적 작업 분해와 같은 문제를 웹 작업에서 가장 잘 완화하는 계획 프레임워크는 무엇인가?
RQ3새로운 궤적 중심 평가 지표가 최종 작업 성공 외에 다양한 계획 전략의 강점과 약점을 드러낼 수 있는가?
RQ4Full-Plan-in-Advance 계획 접근법이 Step-by-Step에 비해 요소 정확도와 같은 기술적 지표를 개선하는가?
RQ5인간 골드 궤적을 사용해 웹 에이전트의 계획 실패를 벤치마크하고 진단할 수 있는가?

주요 결과

Step-by-Step 에이전트가 전반적 성공에서 인간 골드 궤적과 더 높은 정렬성을 보인다(38.41%).
Full-Plan-in-Advance 에이전트가 요소 정확도에서 더 높은 성과를 보인다(89%).
계획 성능 벤치마킹을 위한 794-trajectory의 인간 라벨링 WebArena 데이터셋이 생성되었다.
다섯 가지 평가 지표가 이진 성공 지표를 넘어 궤적 품질을 포착한다.
프레임워크를 통해 맥락 소실과 비일관적 작업 분해로 인한 실패를 진단할 수 있다.
실험 결과 궤적 인식 지표가 어플리케이션 제약에 따라 아키텍처를 선택하는 데 필요하다는 것을 시사한다.

Figure 2. Success rates of Step-by-Step agent and Full-Plan-in-Advance agent on the WebArena benchmark divided to success on each domain.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.