[論文レビュー] Fine-Grained Human Feedback Gives Better Rewards for Language Model Training
本論文は Fine-Grained RLHF を提案し、密度の高いカテゴリ特異的報酬と複数の報酬モデルを用いて言語モデルを訓練し、全体的報酬より脱毒性と長文QAの性能を向上させる。
Language models (LMs) often exhibit undesirable text generation behaviors, including generating false, toxic, or irrelevant outputs. Reinforcement learning from human feedback (RLHF) - where human preference judgments on LM outputs are transformed into a learning signal - has recently shown promise in addressing these issues. However, such holistic feedback conveys limited information on long text outputs; it does not indicate which aspects of the outputs influenced user preference; e.g., which parts contain what type(s) of errors. In this paper, we use fine-grained human feedback (e.g., which sentence is false, which sub-sentence is irrelevant) as an explicit training signal. We introduce Fine-Grained RLHF, a framework that enables training and learning from reward functions that are fine-grained in two respects: (1) density, providing a reward after every segment (e.g., a sentence) is generated; and (2) incorporating multiple reward models associated with different feedback types (e.g., factual incorrectness, irrelevance, and information incompleteness). We conduct experiments on detoxification and long-form question answering to illustrate how learning with such reward functions leads to improved performance, supported by both automatic and human evaluation. Additionally, we show that LM behaviors can be customized using different combinations of fine-grained reward models. We release all data, collected human feedback, and codes at https://FineGrainedRLHF.github.io.
研究の動機と目的
- generated by translating
提案手法
- Model language generation as an RL problem with PPO using a learned reward model.
- Introduce multiple fine-grained reward models R_phi_k, each targeting a specific error category C_k and density (e.g., sub-sentence, sentence, full sequence).
- Compute token-level rewards r_t by aggregating dense rewards across segments and applying a KL penalty for fluency.
- Train reward models on human-annotated fine-grained feedback and integrate them into PPO without updating reward models during RL.
- For QA, construct QA-Feedback with three error categories and density levels, training separate reward models (R_phi1, R_phi2, R_phi3).
- Compare Fine-Grained RLHF to holistic RLHF and other baselines, including ablations and weight configurations to study customization.
実験結果
リサーチクエスチョン
- RQ1ファイングレインドで密度ベースのフィードバックは RLHF において全体報酬より強力な学習信号を提供しますか?
- RQ2異なるエラータイプの別々の報酬モデルは長文アウトプットの事実性、関連性、完全性を改善しますか?
- RQ3複数のファイングレード報酬を組み合わせると LM の挙動をカスタマイズ可能になりますか?
- RQ4ファイングレードなフィードバックは全体的なフィードバックより脱毒化と長文QAのサンプル効率を高めますか?
主な発見
| System | avg max Toxicity (↓) | PPL (↓) | dist-2 (↑) | dist-3 (↑) |
|---|---|---|---|---|
| GPT-2 | 0.192 | 9.58 | 0.947 | 0.931 |
| GeDi | 0.154 | 24.78 | 0.938 | 0.938 |
| Dexperts | 0.136 | 22.83 | 0.932 | 0.922 |
| Hol. RLHF | 0.130 | 11.75 | 0.943 | 0.926 |
| F.G. RLHF | 0.081 | 9.77 | 0.949 | 0.932 |
- Fine-Grained RLHF with sentence-level toxicity rewards achieves the lowest toxicity and competitive fluency and diversity on RealToxicityPrompts, outperforming holistic RLHF.
- Dense, fine-grained rewards enable faster toxicity reduction (sample efficiency) than holistic rewards, at comparable fluency.
- In long-form QA, three separate reward models for relevance, factuality, and information completeness improve respective error categories over baselines.
- Combining weights of multiple reward models modulates LM behavior, enabling customization (short/medium/long style generations) and revealing trade-offs between error types.
- Ablation shows each reward model contributes meaningfully; removing any leads to degraded performance in its targeted aspect.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。