QUICK REVIEW

[論文レビュー] Safe and Scalable Web Agent Learning via Recreated Websites

Hyungjoo Chae, Jungsoo Park|arXiv (Cornell University)|Mar 11, 2026

Machine Learning and Algorithms被引用数 0

ひとこと要約

VeriEnv は実サイトを実行可能な合成環境にクローンし、検証可能なタスク報酬を提供することで、安全でスケーラブルな自己進化型ウェブエージェント学習を実現する。

ABSTRACT

Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at https://github.com/kyle8581/VeriEnv upon acceptance.

研究の動機と目的

実世界のウェブサイトに実際にアクセスせずに、 autonomous なウェブエージェントの安全でスケーラブルな学習を促進する。
実サイトを再現可能な環境へ再構築し、アクセスを制御するパイプラインを提案する。
決定的な報酬を提供する検証可能なタスクとジャッジを生成する。
自己進化トレーニングを通じて未見サイトへの一般化とサイト特異的習熟を示す。

提案手法

ターゲットサイトを合成環境へクローンするためのコーディングエージェントを用いる（コード C、データベース D、Python SDK P）。
実行可能な検証プログラムを P で生成するタスクをLLM に促して検証可能なタスクを作成する。
エピソード終了時の環境状態に対して検証述語を実行し、決定的な報酬を提供する。
検証可能な報酬を用いた自己進化ループで合成環境内のエージェントを訓練する。
未見サイトへの一般化を評価し、環境スケーリングが性能に与える影響を評価する。

Figure 1 : Comparison between the traditional self-evolution paradigm and our verifiable environment framework. (a) In traditional settings, agents interact directly with real-world environments and rely on unvalidated synthetic tasks and non-verifiable, LLM-based reward signals, leading to unsafe e

実験結果

リサーチクエスチョン

RQ1検証可能な合成環境で訓練されたエージェントは、未見の現実サイトへ一般化できるか。
RQ2訓練環境の数を増やすとウェブエージェントの性能は向上するか。
RQ3クローン環境での反復的な自己進化訓練を通じて、サイト特異的習熟を達成できるか。
RQ4検証可能なタスク生成と報酬は、LLM ベースのジャッジや非検証的方法と比較してどうか。

主な発見

VeriEnv で訓練されたエージェントは、WebArena および Mind2Web-Online ベンチマークでベースモデルより性能が向上（+6.06 〜 +9.09 ポイント、ベースモデルに依存）。
VeriEnv で訓練されたエージェントは、クロスドメインのベンチマークで未見のサイトやタスクへ一般化する。
サイト特異的習熟は、クローン環境内での反復訓練を通じて現れ、VeriEnv は検証不可能な手法よりも強く安定した利益を生み出す。
訓練環境の数を増やすと一貫した改善が得られ、環境中心の学習が有効であることを示す。
人間評価では高い環境品質（機能的正確さ約90%）、強いビジュアル評価（4.7/5）、タスク実行性約90%、ジャッジ正答率約76%を示す。

Figure 2 : Overview of VeriEnv . VeriEnv first clones a real website into a fully instrumented synthetic environment (code $C$ , database $D$ , and a Python SDK $P$ ) via coding agent, then uses task and judge generators to produce tasks at varying difficulty and verify both tasks and judges by inte

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。