[论文解读] Weak-to-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision
本文研究在弱监督者标签上微调的强大模型是否能够超越弱监督者进行泛化,并展示显著改进这种弱到强泛化的方法,同时强调超越人类水平对齐的局限性。
Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior - for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit the full capabilities of a much stronger model? We test this using a range of pretrained language models in the GPT-4 family on natural language processing (NLP), chess, and reward modeling tasks. We find that when we naively finetune strong pretrained models on labels generated by a weak model, they consistently perform better than their weak supervisors, a phenomenon we call weak-to-strong generalization. However, we are still far from recovering the full capabilities of strong models with naive finetuning alone, suggesting that techniques like RLHF may scale poorly to superhuman models without further work. We find that simple methods can often significantly improve weak-to-strong generalization: for example, when finetuning GPT-4 with a GPT-2-level supervisor and an auxiliary confidence loss, we can recover close to GPT-3.5-level performance on NLP tasks. Our results suggest that it is feasible to make empirical progress today on a fundamental challenge of aligning superhuman models.
研究动机与目标
- 调查弱监督是否能够调动更强大模型的全部能力。
- 量化在弱标签训练下,强模型相对其弱监督者的超越程度。
- 识别提升弱到强泛化的方法(例如辅助损失、自举、无监督微调)。
- 评估朴素监督以及将RLHF扩展到超人类模型的可扩展性的局限性。
提出的方法
- 通过在真实标签上对小模型进行微调来创建弱监督者。
- 在弱标签上微调强学生模型并测量弱到强的性能。
- 与通过使用真实标签微调得到的强基线进行比较。
- 引入并评估简单的改进技术(辅助置信损失、自举、无监督微调)。
- 定义并计算恢复的性能差距(PGR)以量化对强模型潜力的恢复。
实验结果
研究问题
- RQ1在NLP、国际象棋和奖励建模任务中,基于弱监督训练的强模型是否能超过其弱监督者?
- RQ2天真的弱监督在多大程度上允许恢复强模型的能力(PGR)?
- RQ3哪些简单技巧能够显著改进弱到强泛化?
- RQ4在向超人类模型扩展时,朴素的类似RLHF的监督有哪些局限?
主要发现
- 用弱监督训练的强大预训练模型始终优于它们的弱监督者(弱到强泛化)。
- 在弱标签上的朴素微调通常能恢复部分差距,NLP任务显示显著提升;奖励建模显示提升有限。
- 仅使用朴素方法时,弱到强泛化仍远低于强上限,表明超人类模型在仿RLHF扩展方面的挑战。
- 辅助置信损失可以显著提升NLP泛化,恢复了很大一部分差距(在某些NLP设置下接近80%)。
- 在一些设置中,使用中间模型规模进行自举可在更大差距下改善泛化(特别是在国际象棋领域),但并非普遍有效。
- 无监督生成式微调可以帮助奖励建模,但并未完全消除差距;结果依赖任务。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。