QUICK REVIEW

[论文解读] Safe and Scalable Web Agent Learning via Recreated Websites

Hyungjoo Chae, Jungsoo Park|arXiv (Cornell University)|Mar 11, 2026

Machine Learning and Algorithms被引用 0

一句话总结

VeriEnv 将真实网站克隆为可执行的合成环境，具可验证的任务奖励，实现安全、可扩展的自进化网页代理学习，无需真实世界交互。

ABSTRACT

Training autonomous web agents is fundamentally limited by the environments they learn from: real-world websites are unsafe to explore, hard to reset, and rarely provide verifiable feedback. We propose VeriEnv, a framework that treats language models as environment creators, automatically cloning real-world websites into fully executable, verifiable synthetic environments. By exposing controlled internal access via a Python SDK, VeriEnv enables agents to self-generate tasks with deterministic, programmatically verifiable rewards, eliminating reliance on heuristic or LLM-based judges. This design decouples agent learning from unsafe real-world interaction while enabling scalable self-evolution through environment expansion. Through experiments on web agent benchmarks, we show that agents trained with VeriEnv generalize to unseen websites, achieve site-specific mastery through self-evolving training, and benefit from scaling the number of training environments. Code and resources will be released at https://github.com/kyle8581/VeriEnv upon acceptance.

研究动机与目标

促成在不与真实网站交互的前提下，安全、可扩展地学习自主网页代理。
提出一种将真实站点重建为可执行环境并实施受控访问的流水线。
生成可验证的任务和评审，以提供确定性、可检查的奖励。
通过自进化训练在未见站点上实现泛化及对站点的特定掌握。

提出的方法

使用编码代理将目标网站克隆到一个合成环境中（代码 C，数据库 D，Python SDK P）。
通过提示大语言模型（LLMs）生成带有可执行验证程序的任务来创建可验证的任务，验证程序在 P 中实现。
在剧集结束时对环境状态运行验证谓词，以提供确定性奖励。
在可验证奖励的前提下，在合成环境中进行自进化循环的训练。
评估对未见网站的泛化，并评估环境规模对性能的影响。

Figure 1 : Comparison between the traditional self-evolution paradigm and our verifiable environment framework. (a) In traditional settings, agents interact directly with real-world environments and rely on unvalidated synthetic tasks and non-verifiable, LLM-based reward signals, leading to unsafe e

实验结果

研究问题

RQ1在可验证的合成环境中训练的代理能否对未见的真实网站实现泛化？
RQ2增加训练环境数量是否提升网页代理性能？
RQ3通过在克隆环境中重复自进化训练，能否实现对站点的特定掌握？
RQ4可验证的任务生成和奖励与基于LLM的评审以及非可验证方法相比有何差异？

主要发现

在 VeriEnv 训练的代理在 WebArena 和 Mind2Web-Online 基准测试上的性能相较基线模型提升（+6.06 到 +9.09 分，取决于基线模型）。
VeriEnv 训练的代理能对未见的网站和跨域任务实现泛化。
通过在克隆环境中反复训练，出现站点特定掌握，且 VeriEnv 的增益比非可验证方法更强、更加稳定。
增加训练环境数量带来持续的性能提升，表明环境为中心的学习是有效的。
人工评估显示环境质量高（功能正确性约 90%）、视觉评分高（4.7/5）、任务可执行性约 90%，评审正确率约 76%。

Figure 2 : Overview of VeriEnv . VeriEnv first clones a real website into a fully instrumented synthetic environment (code $C$ , database $D$ , and a Python SDK $P$ ) via coding agent, then uses task and judge generators to produce tasks at varying difficulty and verify both tasks and judges by inte

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。