QUICK REVIEW
[論文レビュー] Autonomous Tester Agent Benchmark
Shuyan Zhou|arXiv (Cornell University)|Jul 25, 2023
Topic Modeling被引用数 21
ひとこと要約
WebArenaは、four domainsと812 long-horizon tasksを備えた現実的で再現可能なウェブ環境を提供し、言語-guided autonomous agentsを評価します。GPT-4はエンドツーエンドのタスク成功率14.41%を達成し、人間の性能は78.24%にはるかに及ばない。
ABSTRACT
Openstreetmap docker files required to self-host the WebArena benchmark, as described here:https://webarena.dev/https://arxiv.org/abs/2307.13854https://github.com/web-arena-x/webarena/tree/main/environment_docker Copyright to openstreetmaphttps://www.openstreetmap.org/copyright
研究の動機と目的
- Create a highly realistic, reproducible web environment for autonomous agents operating on the web.
- Cover four real-world domains (e-commerce, forums, development, CMS) with functional tools and knowledge bases.
- Provide a benchmark suite of long-horizon tasks that require functional correctness rather than surface-form action matching.
- Enable evaluation of task execution quality via programmatic correctness across diverse task types.
- Offer baseline agents using prompting strategies to establish upper/lower bounds on current capabilities.
提案手法
- Construct a standalone, Docker-based WebArena environment with four fully functional website domains and utility tools.
- Populate sites with data drawn from real-world counterparts to preserve authenticity while ensuring reproducibility.
- Develop 812 benchmark tasks grounded in high-level natural language intents, with annotations and evaluation programs for functional correctness.
- Define a reward/evaluation framework that checks intermediate states and final outcomes, accommodating multiple valid execution paths.
- Experiment with several LLM-based baselines (e.g., GPT-4, GPT-3.5, text-bison) using prompting strategies including Chain-of-Thought and Unachievable hints.
- Represent observations with multi-tab browser-like content (URL, page content, DOM or accessibility trees) and provide an action space mirroring web interactions (click, type, navigate, etc.).
実験結果
リサーチクエスチョン
- RQ1How well can current language models understand and execute long-horizon web tasks from high-level NL intents?
- RQ2What is the gap between state-of-the-art LLM agents and human performance on realistic, interactive web tasks?
- RQ3How do prompting strategies (with/without chain-of-thought) and fail-stop hints affect agent performance?
- RQ4Do tasks exhibit consistent difficulty across templates, and can memory or planning improvements close the gap to humans?
- RQ5What evaluation framework best captures functional correctness across diverse web interactions?
主な発見
| CoT | UA Hint | Model | SR | SR_AC | SR_UA |
|---|---|---|---|---|---|
| ✓ | ✓ | text-bison-001 | 5.05 | 4.00 | 27.78 |
| ✗ | ✓ | GPT-3.5 | 6.41 | 4.90 | 38.89 |
| ✓ | ✓ | GPT-3.5 | 8.75 | 6.44 | 58.33 |
| ✓ | ✓ | GPT-4 | 11.70 | 8.63 | 77.78 |
| ✗ | ✗ | GPT-3.5 | 5.10 | 4.90 | 8.33 |
| ✓ | ✗ | GPT-3.5 | 6.16 | 6.06 | 8.33 |
| ✓ | ✗ | GPT-4 | 14.41 | 13.02 | 44.44 |
| - | ✓ | Human | 78.24 | 77.30 | 100.00 |
- GPT-4 with chain-of-thought achieves 14.41% end-to-end task success on WebArena, far below human performance at 78.24%.
- Baseline models show limited gains from explicit reasoning, with GPT-4 outperforming GPT-3.5 and other baselines but remaining far from human abilities.
- The benchmark contains 812 tasks spanning four domains (e-commerce, forums, development, CMS) and auxiliary tools, designed to test long-horizon reasoning and multi-step interactions.
- Functional correctness is evaluated via programmatic checks on intermediate states and final outcomes, allowing multiple valid execution paths per task.
- Human performance remains robust, while models frequently misinterpret intents or fail to complete multi-step operations, highlighting the need for improved exploration and failure recovery capabilities.
- The results underscore that current LLMs struggle with real-world, interactive web tasks, validating WebArena as a meaningful metric for progress.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。