Skip to main content
QUICK REVIEW

[論文レビュー] Autonomous Tester Agent Benchmark

Shuyan Zhou|arXiv (Cornell University)|Jul 25, 2023
Topic Modeling被引用数 21
ひとこと要約

WebArenaは、four domainsと812 long-horizon tasksを備えた現実的で再現可能なウェブ環境を提供し、言語-guided autonomous agentsを評価します。GPT-4はエンドツーエンドのタスク成功率14.41%を達成し、人間の性能は78.24%にはるかに及ばない。

ABSTRACT

Openstreetmap docker files required to self-host the WebArena benchmark, as described here:https://webarena.dev/https://arxiv.org/abs/2307.13854https://github.com/web-arena-x/webarena/tree/main/environment_docker Copyright to openstreetmaphttps://www.openstreetmap.org/copyright

研究の動機と目的

  • Create a highly realistic, reproducible web environment for autonomous agents operating on the web.
  • Cover four real-world domains (e-commerce, forums, development, CMS) with functional tools and knowledge bases.
  • Provide a benchmark suite of long-horizon tasks that require functional correctness rather than surface-form action matching.
  • Enable evaluation of task execution quality via programmatic correctness across diverse task types.
  • Offer baseline agents using prompting strategies to establish upper/lower bounds on current capabilities.

提案手法

  • Construct a standalone, Docker-based WebArena environment with four fully functional website domains and utility tools.
  • Populate sites with data drawn from real-world counterparts to preserve authenticity while ensuring reproducibility.
  • Develop 812 benchmark tasks grounded in high-level natural language intents, with annotations and evaluation programs for functional correctness.
  • Define a reward/evaluation framework that checks intermediate states and final outcomes, accommodating multiple valid execution paths.
  • Experiment with several LLM-based baselines (e.g., GPT-4, GPT-3.5, text-bison) using prompting strategies including Chain-of-Thought and Unachievable hints.
  • Represent observations with multi-tab browser-like content (URL, page content, DOM or accessibility trees) and provide an action space mirroring web interactions (click, type, navigate, etc.).

実験結果

リサーチクエスチョン

  • RQ1How well can current language models understand and execute long-horizon web tasks from high-level NL intents?
  • RQ2What is the gap between state-of-the-art LLM agents and human performance on realistic, interactive web tasks?
  • RQ3How do prompting strategies (with/without chain-of-thought) and fail-stop hints affect agent performance?
  • RQ4Do tasks exhibit consistent difficulty across templates, and can memory or planning improvements close the gap to humans?
  • RQ5What evaluation framework best captures functional correctness across diverse web interactions?

主な発見

CoTUA HintModelSRSR_ACSR_UA
text-bison-0015.054.0027.78
GPT-3.56.414.9038.89
GPT-3.58.756.4458.33
GPT-411.708.6377.78
GPT-3.55.104.908.33
GPT-3.56.166.068.33
GPT-414.4113.0244.44
-Human78.2477.30100.00
  • GPT-4 with chain-of-thought achieves 14.41% end-to-end task success on WebArena, far below human performance at 78.24%.
  • Baseline models show limited gains from explicit reasoning, with GPT-4 outperforming GPT-3.5 and other baselines but remaining far from human abilities.
  • The benchmark contains 812 tasks spanning four domains (e-commerce, forums, development, CMS) and auxiliary tools, designed to test long-horizon reasoning and multi-step interactions.
  • Functional correctness is evaluated via programmatic checks on intermediate states and final outcomes, allowing multiple valid execution paths per task.
  • Human performance remains robust, while models frequently misinterpret intents or fail to complete multi-step operations, highlighting the need for improved exploration and failure recovery capabilities.
  • The results underscore that current LLMs struggle with real-world, interactive web tasks, validating WebArena as a meaningful metric for progress.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。