QUICK REVIEW

[论文解读] WebFactory: Automated Compression of Foundational Language Intelligence into Grounded Web Agents

Sicheng Fan, Qingyun Shi|arXiv (Cornell University)|Mar 5, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

WebFactory 提供一个完全自动化的闭环强化学习管线，将基础模型知识压缩为有 grounding 的 GUI 代理，利用高保真离线网页环境实现强数据效率和跨领域泛化，同时开源工具链。

ABSTRACT

Current paradigms for training GUI agents are fundamentally limited by a reliance on either unsafe, non-reproducible live web interactions or costly, scarce human-crafted data and environments. We argue this focus on data volume overlooks a more critical factor: the efficiency of compressing a large language model's (LLM) latent knowledge into actionable agent behavior. We introduce WebFactory, a novel, fully automated closed-loop reinforcement learning pipeline for GUI agents, systematically compressing LLM-encoded internet intelligence into efficient, grounded actions. Our pipeline features a process of scalable environment synthesis, knowledge-aware task generation, LLM-powered trajectory collection, decomposed reward RL training, and systematic agent evaluation. Remarkably, our agent demonstrates exceptional data efficiency and generalization. Trained on synthetic data from only 10 websites within WebFactory, it achieves performance comparable to GUI agents trained on the same amount of human-annotated data from a much larger set of environments. This superior performance is consistent across our internal offline and online transfer benchmarks, where our agent also significantly outperforms the base foundation model. We further provide critical insights into the "embodiment potential" of different LLM foundations, offering a new axis for model evaluation. This work presents a scalable and cost-effective paradigm for transforming passive internet knowledge into active, grounded intelligence, marking a critical step towards general-purpose interactive agents.

研究动机与目标

将 embodied GUI 代理从数据量导向转向智能压缩效率的研究动机
开发一个可完全 controllable 的离线网页环境，能真实复现生产站点并确保可重复性
自动化知识驱动的任务生成，以产生可执行、真实任务且无需人工标注
通过统一动作空间和分解奖励的强化学习训练 GUI 代理，以改进任务完成度与 grounding
在内部基准和公开 GUI 基准上评估泛化能力，并分析基础模型的 embodiment 潜力

提出的方法

创建一个高保真离线网页环境，关闭实时网页噪声和安全问题，同时暴露站点知识与交互逻辑
使用知识驱动的任务生成器，利用导航图、页面语义和规范的交互流程，生成可执行且可验证的任务
通过在离线环境中使用强执行器执行任务来生成大规模轨迹，并对质量和正确性进行筛选
在统一动作空间（每个动作为（类型、点、文本））下进行强化学习训练，并通过分解奖励结合格式验证与准确性来优化
通过带有 grounding 的剧本回放和任务级指标对代理进行评估，避免人工评审，并衡量对在线平台和公开基准的迁移
开源完整工具包，包括环境、生成器、训练管线和评估工具

实验结果

研究问题

RQ1LLM 编码的互联网智能在受控离线环境中被压缩为 grounding、可执行的 GUI 策略的效果如何？
RQ2将知识驱动的任务生成与数据驱动的轨迹采集结合对任务可执行性与轨迹质量的影响如何？
RQ3仅在 WebFactory 上用合成数据训练的代理在真实在线平台和公开 GUI 基准上的迁移程度如何？
RQ4不同基础模型的 embodiment 潜力如何影响最终的 GUI 代理表现？
RQ5完全环境可观测性与分解奖励在实现数据高效、可泛化的 GUI 代理中的作用是什么？

主要发现

知识与数据驱动的任务生成显著提高可执行性（31.3% → 86.3%）与有效性（42.3% → 92.6%）
在知识驱动方法下，轨迹成功率提升（42.6% → 84.3%），任务平均步数下降（15.7 → 9.8）
WebFactory-3B 实现在离线到在线的强迁移（平均 TCR 53.4%、准确率 77.4%），在亚马逊、Airbnb、Booking 等基线上表现出色
在公开 GUI 基准上，WebFactory-3B 在 GUI-Act-Web 上达到 SR 84.2%，在 GUI-Odyssey 上达到 66.0% 的类型准确性，显示出稳健的跨域泛化能力
以 GPT-5 作为生成管线的基础模型在多样 GUI 基准上表现最佳，Claude Opus 4.1 具竞争力，Claude Sonnet 4 波动较大

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。