[论文解读] Self-Rewarding Language Models
本文提出自我奖励语言模型,通过迭代训练让模型生成并评估自己的指令遵循数据,使用 LLM-as-a-Judge 提示和 Direct Preference Optimization,在多轮迭代中实现更好的指令遵循和奖励建模。
We posit that to achieve superhuman agents, future models require superhuman feedback in order to provide an adequate training signal. Current approaches commonly train reward models from human preferences, which may then be bottlenecked by human performance level, and secondly these separate frozen reward models cannot then learn to improve during LLM training. In this work, we study Self-Rewarding Language Models, where the language model itself is used via LLM-as-a-Judge prompting to provide its own rewards during training. We show that during Iterative DPO training that not only does instruction following ability improve, but also the ability to provide high-quality rewards to itself. Fine-tuning Llama 2 70B on three iterations of our approach yields a model that outperforms many existing systems on the AlpacaEval 2.0 leaderboard, including Claude 2, Gemini Pro, and GPT-4 0613. While there is much left still to explore, this work opens the door to the possibility of models that can continually improve in both axes.
研究动机与目标
- Motivate and develop a training signal for LLMs that does not rely on fixed human-derived reward models.
- Enable a single model to both perform instruction following and generate/evaluate its own training data.
- Demonstrate iterative improvement through AI Feedback training and Direct Preference Optimization.
- Assess how self-generated rewards impact instruction quality and reward modeling accuracy.
提出的方法
- Define a two-skill model that can follow instructions and create/evaluate new instruction-following data.
- Use Iterative Direct Preference Optimization (Iterative DPO) where each iteration augments data with AI Feedback (AIFT) generated by the current model.
- Implement LLM-as-a-Judge prompting to assign rewards to candidate responses and construct winning/losing pairs for training.
- Start from a seed model fine-tuned on Open Assistant data, then perform multiple self-generated training rounds.
- Evaluate instruction following and reward modeling via head-to-head prompts, AlpacaEval 2.0 leaderboard, MT-Bench, and NLP benchmarks.
实验结果
研究问题
- RQ1Can a language model improve its own reward modeling ability by self-generating and self-evaluating training data?
- RQ2Does iterative self-alignment yield measurable gains in instruction following compared to a seed or traditional SFT baseline?
- RQ3How does self-rewarding training impact alignment with human preferences and external evaluation metrics?
- RQ4What are the limits and domain-specific strengths/weaknesses of self-rewarding LLMs across benchmarks?
主要发现
- Iterative self-rewarding training yields progressive gains in instruction following over iterations (M1 to M3).
- M1 shows baseline improvement over the SFT baseline; M2 and M3 outperform earlier iterations and the seed SFT in head-to-head evaluations.
- On AlpacaEval 2.0, Iteration 3 (M3) achieves 20.44% win rate over GPT-4 Turbo, surpassing several models with proprietary data, and outperforms Claude 2, Gemini Pro, and GPT-4 0613 in this setup.
- Reward modeling ability improves with each iteration, with pairwise accuracy rising from 65.1% (SFT) to 78.7% (M1), 80.4% (M2), and 81.7% (M3).
- IFT+EFT augmentation improves reward-model alignment metrics, increasing pairwise accuracy with humans from 65.1% to 78.7%.
- MT-Bench scores improve across iterations (overall 6.85→7.25), with larger gains in humanities, STEM, and writing categories.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。