[论文解读] RAFT: Reward rAnked FineTuning for Generative Foundation Model Alignment
RAFT 是一个框架,通过对每个提示采样多种输出,用奖励模型对它们进行排序,并在 ranking top 的样本上进行微调,相较于基于 PPO 的 RLHF,提供稳定性与效率的提升。
Generative foundation models are susceptible to implicit biases that can arise from extensive unsupervised training data. Such biases can produce suboptimal samples, skewed outcomes, and unfairness, with potentially serious consequences. Consequently, aligning these models with human ethics and preferences is an essential step toward ensuring their responsible and effective deployment in real-world applications. Prior research has primarily employed Reinforcement Learning from Human Feedback (RLHF) to address this problem, where generative models are fine-tuned with RL algorithms guided by a human-feedback-informed reward model. However, the inefficiencies and instabilities associated with RL algorithms frequently present substantial obstacles to the successful alignment, necessitating the development of a more robust and streamlined approach. To this end, we introduce a new framework, Reward rAnked FineTuning (RAFT), designed to align generative models effectively. Utilizing a reward model and a sufficient number of samples, our approach selects the high-quality samples, discarding those that exhibit undesired behavior, and subsequently enhancing the model by fine-tuning on these filtered samples. Our studies show that RAFT can effectively improve the model performance in both reward learning and other automated metrics in both large language models and diffusion models.
研究动机与目标
- 推动对齐生成式基础模型以符合人类偏好和伦理
- 展示 RLHF/PPO 在稳定性、记忆与数据需求方面的局限
- 提出 RAFT 作为一个健壮的替代方案,利用基于奖励的样本排序与有监督微调
- Demonstrate RAFT’s applicability to large language models and diffusion-model-like systems.
- Quantify RAFT performance against baselines on standard alignment benchmarks
提出的方法
- 迭代性收集一批提示并使用当前模型对每个提示生成 K 个回答
- 使用奖励模型对每个提示的 K 个回答进行排序,并选择得分最高的样本
- 在筛选后的高奖励样本上对模型进行微调,并重复上述三步直到收敛
- 强调数据收集与模型更新解耦以提升稳定性并降低记忆负担
- 可选地通过 KL 惩罚引入流畅性/多样性正则化以约束偏离初始模型
- 提供超参数指导(b, K, lambda, beta)并讨论实现注意事项
实验结果
研究问题
- RQ1RAFT 在稳定性提升与降低内存需求的同时,是否能实现与 PPO 基于 RLHF 相当的对齐性能?
- RQ2RAFT 的主要超参数(K, lambda, beta)如何影响奖励、困惑度和多样性指标?
- RQ3RAFT 对奖励噪声与奖励缩放是否具鲁棒性,基于排序的过滤是否有助于缓解奖励欺骗?
- RQ4RAFT 是否可以扩展到超越 LLM 的扩散模型样生成器?
主要发现
- RAFT 对齐的模型在 HH-RLHF 数据集上相比起始的 SFT 和 PPO 基线,平均奖励更高
- RAFT-K32 与 lambda 1.0 在保持适中困惑度(4.031)的同时达到最高平均奖励(2.294)
- RAFT 相较于 PPO 在奖励与困惑度之间表现出更好的平衡(在报道的实验中)
- 增加 K 往往提升最佳-的 K 性能与多样性指标,代价是推理时间增加
- RAFT 在超参数设置下表现出稳定性,并显示出对奖励缩放与噪声的鲁棒性,相较于 PPO
- GPT-4 与人工评估与自动指标一致,在成对评估中偏好 RAFT 对齐的模型
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。