QUICK REVIEW

[论文解读] WebGPT: Browser-assisted question-answering with human feedback

Reiichiro Nakano, Jacob Hilton|arXiv (Cornell University)|Dec 17, 2021

Topic Modeling被引用 33

一句话总结

通过文本型网页浏览环境微调 GPT-3 以回答长形式问题，采用模仿学习和奖励建模，结合人类反馈与参考文献。表现最佳的模型在偏好评估中优于人类示范者和 Reddit 的最高投票答案。

ABSTRACT

We fine-tune GPT-3 to answer long-form questions using a text-based web-browsing environment, which allows the model to search and navigate the web. By setting up the task so that it can be performed by humans, we are able to train models on the task using imitation learning, and then optimize answer quality with human feedback. To make human evaluation of factual accuracy easier, models must collect references while browsing in support of their answers. We train and evaluate our models on ELI5, a dataset of questions asked by Reddit users. Our best model is obtained by fine-tuning GPT-3 using behavior cloning, and then performing rejection sampling against a reward model trained to predict human preferences. This model's answers are preferred by humans 56% of the time to those of our human demonstrators, and 69% of the time to the highest-voted answer from Reddit.

研究动机与目标

通过将检索外包给网页浏览器并与语言模型综合来激励长文问答。
通过使用人类示范和比较实现模仿学习训练。
使用奖励建模和基于人类偏好的拒绝采样来提升答案质量。
要求模型生成的引用来支持事实陈述，以便更易评估。

提出的方法

创建一个文本为基础的网页浏览环境，模型可以发出类似浏览器的命令。
使用来自人类示范的行为克隆对 GPT-3 模型（760M、13B、175B）进行微调。
通过人类比较训练奖励模型，以带有引用的方式对答案质量进行评分。
使用强化学习（PPO）对抗奖励模型和/或拒绝采样（best-of-n）来选择高分答案。
在 ELI5 和 TruthfulQA 上进行评估，并与人类示范和 Reddit 最高投票答案进行比较。

实验结果

研究问题

RQ1语言模型是否能够通过检索/搜索、综合与人类偏好联动训练来实现高质量的长文问答？
RQ2在网页浏览设置中，人类示范和比较是否能带来比基线或自动指标更优的答案？
RQ3在基于奖励模型优化答案方面，拒绝采样相对于强化学习的有效性如何？
RQ4在对抗性或分布外数据集如 TruthfulQA 上，WebGPT 的真实度和信息性表现如何？

主要发现

使用 175B 模型的 Best-of-64 采样，偏好答案相对于人类示范者的比例为 56%。
同一模型的答案在 69% 的时间内胜过 Reddit 的最高投票答案（去除引文）。
在 TruthfulQA 的真实度和信息性指标上，所有 WebGPT 模型都优于 GPT-3 基线。
拒绝采样相对于 BC 提供了可观的增益；RL 提供的增益较小，将 RL 与拒绝采样结合的额外收益有限。
扩展趋势显示更大模型和更多数据可提升基于奖励模型的偏好与真实度指标。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。