QUICK REVIEW

[論文レビュー] Training language models to follow instructions with human feedback

Long Ouyang, Jeff Wu|arXiv (Cornell University)|Mar 4, 2022

Topic Modeling被引用数 4,260

ひとこと要約

本論文は、human demonstrations and preferences (RLHF) で GPT-3 をファインチューニングすると InstructGPT が得られ、GPT-3 baselines よりはるかに少ないパラメータ数で上回り、真実性を向上させ、広範なタスクセットで有害性を低減することを示している。

ABSTRACT

Making language models bigger does not inherently make them better at following a user's intent. For example, large language models can generate outputs that are untruthful, toxic, or simply not helpful to the user. In other words, these models are not aligned with their users. In this paper, we show an avenue for aligning language models with user intent on a wide range of tasks by fine-tuning with human feedback. Starting with a set of labeler-written prompts and prompts submitted through the OpenAI API, we collect a dataset of labeler demonstrations of the desired model behavior, which we use to fine-tune GPT-3 using supervised learning. We then collect a dataset of rankings of model outputs, which we use to further fine-tune this supervised model using reinforcement learning from human feedback. We call the resulting models InstructGPT. In human evaluations on our prompt distribution, outputs from the 1.3B parameter InstructGPT model are preferred to outputs from the 175B GPT-3, despite having 100x fewer parameters. Moreover, InstructGPT models show improvements in truthfulness and reductions in toxic output generation while having minimal performance regressions on public NLP datasets. Even though InstructGPT still makes simple mistakes, our results show that fine-tuning with human feedback is a promising direction for aligning language models with human intent.

研究の動機と目的

Demonstrate that fine-tuning large language models with human feedback can align them to user instructions across diverse tasks.
Show that smaller, instruction-tuned models can outperform much larger baselines on instruction-following prompts.
Evaluate changes in truthfulness, toxicity, and adherence to explicit constraints when using RLHF.
Assess generalization to held-out labelers and to real API prompts beyond training data.

提案手法

Collect demonstrated outputs from labelers to train a supervised fine-tuning (SFT) model.
Collect pairwise human preferences over model outputs to train a reward model (RM).
Use RM as reward to fine-tune the policy with Proximal Policy Optimization (PPO).
Mix PPO updates with pretraining gradients (PPO-ptx) to reduce performance regressions on public NLP datasets.
Evaluate using human labeler preferences on held-out API prompts and on public NLP datasets; compare to GPT-3 and FLAN/T0 baselines.

実験結果

リサーチクエスチョン

RQ1Can RLHF fine-tuning align a language model to follow user instructions across a broad task distribution?
RQ2Do smaller InstructGPT models outperform the much larger GPT-3 baselines on instruction-following prompts?
RQ3How does RLHF affect truthfulness, toxicity, and adherence to explicit constraints compared to baseline models?
RQ4Do InstructGPT models generalize to held-out labelers and prompts outside the training distribution?
RQ5What trade-offs (alignment tax) arise between alignment gains and performance on standard NLP benchmarks?

主な発見

InstructGPT outputs are preferred to GPT-3 outputs; the 1.3B InstructGPT is preferred over 175B GPT-3 in the API prompt distribution.
InstructGPT models exhibit about twice the truthfulness on TruthfulQA compared with GPT-3 and show roughly half the hallucination rate on open-domain tasks (21% vs 41%).
InstructGPT reduces toxic outputs by about 25% compared with GPT-3 when prompted to be respectful.
There are small but measurable performance regressions on some public NLP datasets (SQuAD, DROP, HellaSwag, WMT 2015 FR→EN) after alignment, which can be mitigated by mixing PPO with pretraining gradients (PPO-ptx).
Even with a 1.3B parameter model, InstructGPT can outperform the 175B GPT-3 on instruction-following tasks, indicating the value of alignment over sheer scale.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。