QUICK REVIEW

[论文解读] WebShop: Towards Scalable Real-World Web Interaction with Grounded Language Agents

Shunyu Yao, Howard Chen|arXiv (Cornell University)|Jul 4, 2022

Topic Modeling被引用 46

一句话总结

WebShop 引入了一个大规模的模拟电子商务网页环境，包含 1.18M 实际世界产品和 12,087 条 crowd-sourced 指令，以通过 RL 和模仿学习研究具备 grounding 的语言代理，最佳任务成功率为 28.7% 对比人类 59.6%，并展示了对 amazon.com 和 ebay.com 的仿真到现实转移。

ABSTRACT

Existing benchmarks for grounding language in interactive environments either lack real-world linguistic elements, or prove difficult to scale up due to substantial human involvement in the collection of data or feedback signals. To bridge this gap, we develop WebShop -- a simulated e-commerce website environment with $1.18$ million real-world products and $12,087$ crowd-sourced text instructions. Given a text instruction specifying a product requirement, an agent needs to navigate multiple types of webpages and issue diverse actions to find, customize, and purchase an item. WebShop provides several challenges for language grounding including understanding compositional instructions, query (re-)formulation, comprehending and acting on noisy text in webpages, and performing strategic exploration. We collect over $1,600$ human demonstrations for the task, and train and evaluate a diverse range of agents using reinforcement learning, imitation learning, and pre-trained image and language models. Our best model achieves a task success rate of $29\%$, which outperforms rule-based heuristics ($9.6\%$) but is far lower than human expert performance ($59\%$). We also analyze agent and human trajectories and ablate various model components to provide insights for developing future agents with stronger language understanding and decision making abilities. Finally, we show that agents trained on WebShop exhibit non-trivial sim-to-real transfer when evaluated on amazon.com and ebay.com, indicating the potential value of WebShop in developing practical web-based agents that can operate in the wild.

研究动机与目标

提供一个可扩展、现实的基于网络的基准，用于在交互任务中对语言进行 grounding。
融合现实世界语言、图像，以及多样的动作空间，以反映真实的网络使用。
基于文本和产品属性实现自动奖励计算，以促进可扩展学习。
评估受预训练语言与视觉模型启发的 RL 与模仿学习方法。
研究代理在真实电子商务站点上的仿真到现实转移。

提出的方法

使用带有 ResNet 视觉编码器和 Transformer 文本编码器的模块化架构来建模代理。
使用注意力融合层在上下文中对动作进行评分并生成动作分布。
在人工示范上进行模仿学习训练，并使用策略梯度 RL 微调（IL+RL）。
在语言模型上对组件进行预训练（如 BART、BERT），并将其与固定搜索 oracle 结合用于生成。
将观测与动作表示在一个两模环境中（HTML 和 simple），以帮助训练和仿真到现实的转移。
定义基于属性与选项匹配、价格约束以及类型-文本匹配的奖励函数。

实验结果

研究问题

RQ1可扩展的具 grounding 的语言代理是否能够在现实、规模庞大的网络环境中，使用多样的动作和嘈杂文本，定位并购买产品？
RQ2在此网络环境中，模仿学习与强化学习的比较如何，以及语言预训练对性能的影响？
RQ3在未经微调的情况下，WebShop 训练的代理在实际电子商务站点如 Amazon 和 eBay 上的仿真到现实转移程度有多大？

主要发现

表现最好的模型（IL+RL）在 WebShop 测试集上达到 62.4 任务分数和 28.7% 的成功率。
基于规则的启发式方法达到 45.6 分和 9.6% 的成功率，显示出学习型方法的价值。
人类专家达到 82.1 任务分数和 59.6% 的成功率，凸显当前模型的剩余差距。
零-shot 仿真到现实转移显示 IL+RL 在 Amazon（65.9 分，25% SR）和 eBay（62.3 分，21% SR）的基线规则方法上具有更好表现。
消融实验显示语言预训练在文本生成与决策中的重要性，以及在选项/离散决策准确性方面的挑战。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。