QUICK REVIEW

[論文レビュー] Is Reinforcement Learning (Not) for Natural Language Processing: Benchmarks, Baselines, and Building Blocks for Natural Language Policy Optimization

Rajkumar Ramamurthy, Prithviraj Ammanabrolu|arXiv (Cornell University)|Oct 3, 2022

Topic Modeling被引用数 54

ひとこと要約

本論文は RL4LMs、強化学習で言語モデルを訓練するオープンソースライブラリ、ヒューマンプリファレンスに基づく NLP 整合性を評価する GRUE ベンチマーク、そして PPO に比べて安定性と性能を改善するトークンレベルのマスキングを備えた NLPO を紹介します。

ABSTRACT

We tackle the problem of aligning pre-trained large language models (LMs) with human preferences. If we view text generation as a sequential decision-making problem, reinforcement learning (RL) appears to be a natural conceptual framework. However, using RL for LM-based generation faces empirical challenges, including training instability due to the combinatorial action space, as well as a lack of open-source libraries and benchmarks customized for LM alignment. Thus, a question rises in the research community: is RL a practical paradigm for NLP? To help answer this, we first introduce an open-source modular library, RL4LMs (Reinforcement Learning for Language Models), for optimizing language generators with RL. The library consists of on-policy RL algorithms that can be used to train any encoder or encoder-decoder LM in the HuggingFace library (Wolf et al. 2020) with an arbitrary reward function. Next, we present the GRUE (General Reinforced-language Understanding Evaluation) benchmark, a set of 6 language generation tasks which are supervised not by target strings, but by reward functions which capture automated measures of human preference. GRUE is the first leaderboard-style evaluation of RL algorithms for NLP tasks. Finally, we introduce an easy-to-use, performant RL algorithm, NLPO (Natural Language Policy Optimization) that learns to effectively reduce the combinatorial action space in language generation. We show 1) that RL techniques are generally better than supervised methods at aligning LMs to human preferences; and 2) that NLPO exhibits greater stability and performance than previous policy gradient methods (e.g., PPO (Schulman et al. 2017)), based on both automatic and human evaluations.

研究の動機と目的

RL が事前学習済み LM を人間の好みと整合させるのに有効であることを示す。
RLベースの LM 最適化のためのオープンソースのモジュール式ツールキットを提供する。
GRUE を人間の報酬に基づく RL ベースの NLP タスクのベンチマークとして導入する。
大規模な言語生成の action space を緩和し、LM 整合タスクにおける訓練安定性を改善する NLPO を提案する。

提案手法

RL4LMs を、HuggingFace モデルおよび stable-baselines-3 と互換のオンポリシー RL ツールキットとして開発する。
言語生成をトークンレベルの MDP として、トークンごとやシーケンスごとに報酬を設定する。
訓練中のアクション空間を縮小するためにトップ-p マスキングを用いたマスク付き PPO 変種 NLPO を導入する。
KL 基づく正則化報酬を定義し、タスク報酬とベースLMに近づくことのバランスを取る。
GRUE を、報酬ベースの評価とヒューマンスタディを含む多タスクベンチマークとして作成する。
PPO、NLPO、監視ありRL設定の広範なアブレーションと分析を提供する。

実験結果

リサーチクエスチョン

RQ1RL 手法は多様な NLP タスクで人間の好みへの LM の整合を supervise fine-tuning より上回ることができるか。
RQ2NLPO は大規模なアクション空間を持つ言語生成において PPO より安定性と性能の利点を提供するか。
RQ3報酬の品質、基準KL正則化、マスキングは RL の安定性と整合性にどの程度影響するか。
RQ4データ効率やパラメータ効率は、純粋に監視学習と比較して RL アプローチで改善されるか。
RQ5自動評価指標は RL ベースの言語方針最適化において人間の判断とどの程度相関するか。

主な発見

RL 手法は評価対象タスク全体で人間の好みへの LM の整合に対して supervise アプローチより一般的に優れている。
NLPO は自動評価と人間評価の両方で PPO より安定性と性能が高いことを示す。
KL ペナルティとタスク固有のマスキング（トップ-p）は報酬ハッキングを緩和し、整合性を改善する。
監視付きのウォームスタートとデータ効率の良い報酬学習は、小さなモデルで強い性能を発揮できる。
報酬モデルを改善する際には RL の方が supervised 学習よりデータ効率が良い場合があり、NLPO と supervise の組み合わせは大規模な supervised モデルよりいくつかのタスクで上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。