QUICK REVIEW

[論文レビュー] Autonomous Tester Agent Benchmark

Shuyan Zhou|arXiv (Cornell University)|Jul 25, 2023

Topic Modeling被引用数 21

ひとこと要約

WebArenaは、four domainsと812 long-horizon tasksを備えた現実的で再現可能なウェブ環境を提供し、言語-guided autonomous agentsを評価します。GPT-4はエンドツーエンドのタスク成功率14.41%を達成し、人間の性能は78.24%にはるかに及ばない。

ABSTRACT

Openstreetmap docker files required to self-host the WebArena benchmark, as described here:https://webarena.dev/https://arxiv.org/abs/2307.13854https://github.com/web-arena-x/webarena/tree/main/environment_docker Copyright to openstreetmaphttps://www.openstreetmap.org/copyright

研究の動機と目的

Create a highly realistic, reproducible web environment for autonomous agents operating on the web.
Cover four real-world domains (e-commerce, forums, development, CMS) with functional tools and knowledge bases.
Provide a benchmark suite of long-horizon tasks that require functional correctness rather than surface-form action matching.
Enable evaluation of task execution quality via programmatic correctness across diverse task types.
Offer baseline agents using prompting strategies to establish upper/lower bounds on current capabilities.

提案手法

Construct a standalone, Docker-based WebArena environment with four fully functional website domains and utility tools.
Populate sites with data drawn from real-world counterparts to preserve authenticity while ensuring reproducibility.
Develop 812 benchmark tasks grounded in high-level natural language intents, with annotations and evaluation programs for functional correctness.
Define a reward/evaluation framework that checks intermediate states and final outcomes, accommodating multiple valid execution paths.
Experiment with several LLM-based baselines (e.g., GPT-4, GPT-3.5, text-bison) using prompting strategies including Chain-of-Thought and Unachievable hints.
Represent observations with multi-tab browser-like content (URL, page content, DOM or accessibility trees) and provide an action space mirroring web interactions (click, type, navigate, etc.).

実験結果

リサーチクエスチョン

RQ1How well can current language models understand and execute long-horizon web tasks from high-level NL intents?
RQ2What is the gap between state-of-the-art LLM agents and human performance on realistic, interactive web tasks?
RQ3How do prompting strategies (with/without chain-of-thought) and fail-stop hints affect agent performance?
RQ4Do tasks exhibit consistent difficulty across templates, and can memory or planning improvements close the gap to humans?
RQ5What evaluation framework best captures functional correctness across diverse web interactions?

主な発見

CoT	UA Hint	Model	SR	SR_AC	SR_UA
✓	✓	text-bison-001	5.05	4.00	27.78
✗	✓	GPT-3.5	6.41	4.90	38.89
✓	✓	GPT-3.5	8.75	6.44	58.33
✓	✓	GPT-4	11.70	8.63	77.78
✗	✗	GPT-3.5	5.10	4.90	8.33
✓	✗	GPT-3.5	6.16	6.06	8.33
✓	✗	GPT-4	14.41	13.02	44.44
-	✓	Human	78.24	77.30	100.00

GPT-4 with chain-of-thought achieves 14.41% end-to-end task success on WebArena, far below human performance at 78.24%.
Baseline models show limited gains from explicit reasoning, with GPT-4 outperforming GPT-3.5 and other baselines but remaining far from human abilities.
The benchmark contains 812 tasks spanning four domains (e-commerce, forums, development, CMS) and auxiliary tools, designed to test long-horizon reasoning and multi-step interactions.
Functional correctness is evaluated via programmatic checks on intermediate states and final outcomes, allowing multiple valid execution paths per task.
Human performance remains robust, while models frequently misinterpret intents or fail to complete multi-step operations, highlighting the need for improved exploration and failure recovery capabilities.
The results underscore that current LLMs struggle with real-world, interactive web tasks, validating WebArena as a meaningful metric for progress.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。