Skip to main content
QUICK REVIEW

[论文解读] Understanding the Effects of RLHF on LLM Generalisation and Diversity

Robert Kirk, Ishita Mediratta|arXiv (Cornell University)|Oct 10, 2023
Natural Language Processing Techniques被引用 13
一句话总结

本文分析监督式微调(SFT)、奖励建模(RM)和来自人类反馈的强化学习(RLHF)如何影响分布外泛化和输出多样性,揭示在RLHF提高泛化但降低多样性之间的权衡。

ABSTRACT

Large language models (LLMs) fine-tuned with reinforcement learning from human feedback (RLHF) have been used in some of the most widely deployed AI models to date, such as OpenAI's ChatGPT or Anthropic's Claude. While there has been significant work developing these methods, our understanding of the benefits and downsides of each stage in RLHF is still limited. To fill this gap, we present an extensive analysis of how each stage of the process (i.e. supervised fine-tuning (SFT), reward modelling, and RLHF) affects two key properties: out-of-distribution (OOD) generalisation and output diversity. OOD generalisation is crucial given the wide range of real-world scenarios in which these models are being used, while output diversity refers to the model's ability to generate varied outputs and is important for a variety of use cases. We perform our analysis across two base models on both summarisation and instruction following tasks, the latter being highly relevant for current LLM use cases. We find that RLHF generalises better than SFT to new inputs, particularly as the distribution shift between train and test becomes larger. However, RLHF significantly reduces output diversity compared to SFT across a variety of measures, implying a tradeoff in current LLM fine-tuning methods between generalisation and diversity. Our results provide guidance on which fine-tuning method should be used depending on the application, and show that more research is needed to improve the tradeoff between generalisation and diversity.

研究动机与目标

  • 评估SFT、RM和RLHF对在分布内性能、分布外泛化和输出多样性的影响。
  • 使用多种度量在单输入和跨输入设置下量化多样性。
  • 确定Best-of-N(BoN)或其他阶段是否解释 RLHF 与 SFT 之间的差异。
  • 在摘要和指令执行任务上使用鲁棒的OOD测试集评估结果。

提出的方法

  • 使用三种技术对LLaMa 7B基础模型进行微调:SFT、奖励建模(RM)和基于人类反馈的强化学习(RLHF)。
  • 训练RM以预测输出对之间的人类偏好;在RLHF中将RM与PPO和KL惩罚结合,使策略保持接近SFT。
  • 评估BoN采样作为参考点,以分离RM与优化的影响。
  • 使用GPT-4作为模拟人类评估者,衡量摘要和指令执行任务的ID和OOD性能(PvR)。
  • 用不同的N-grams(EAD)、Sentence-BERT余弦相似度,以及跨输入与跨输入设置下的NLI多样性来衡量输出多样性。

实验结果

研究问题

  • RQ1SFT、RM和RLHF如何单独促成对分布外输入的泛化?
  • RQ2与SFT相比,RLHF在不同任务上对模型输出多样性的影响如何?
  • RQ3Best-of-N采样是否能复制RM驱动的RLHF的优势,还是揭示不同的动力学?
  • RQ4在摘要和指令执行任务中,RLHF在泛化与多样性之间的权衡如何?

主要发现

  • RLHF在ID上以及尤其在OOD上比SFT表现更好。
  • 相比SFT,RLHF在单输入度量上显著降低输出多样性,在跨输入多样性方面虽有较弱但仍存在的降低。
  • Best-of-N在某些设置中可能优于RLHF,但其收益取决于基础模型的泛化;BoN在推理阶段成本更高。
  • KL惩罚不会改善多样性-泛化之间的权衡;增大KL往往同时降低性能和单输入多样性。
  • 跨任务而言,RLHF在OOD中的相对优势在更困难的分布偏移上更为明显(尤其在指令执行方面)。
  • 存在RLHF下跨输入的模式崩溃证据,表明输入间的多样性下降。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。