QUICK REVIEW

[論文レビュー] Quark: Controllable Text Generation with Reinforced Unlearning

Ximing Lu, Sean Welleck|arXiv (Cornell University)|May 26, 2022

Topic Modeling被引用数 45

ひとこと要約

QuarkはQuantized Reward Konditioningを導入したオンライン-オフラインフレームワークで、報酬トークンで条件付けしKLダイバージェンスペナルティを用いることで望ましくない言語モデルの挙動を学習解除し、toxicity、sentiment、repetitionの制御でPPOベースラインを上回る。

ABSTRACT

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining nearby the original language model via a KL-divergence penalty. By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO (Schulman et al. 2017), while relying only on standard language modeling primitives.

研究の動機と目的

大規模言語モデルのミスアラインメント（毒性、反復、望ましくない感情表現）を動機づけて対処する。
コア生成能力を保持しつつ望ましくない性質から出力を逸らせるポストホック学習解除法を開発する。
完全なRL機構を用いず、標準的なLMプリミティブを用いた軽量で微分可能なトレーニングループを作成する。
毒性、感情制御、反復タスクに対して強力なベースラインと比較して頑健性を示す。

提案手法

Quantized Reward Konditioning (Quark)を提案し、オンライン・オフポリシーアルゴリズムとして探索、量子化、学習の3段階で(学習解除/解除)を行う。
現在のLMからサンプルを収集し、入力の前に報酬トークンを追加して各サンプルを報酬クォンタイルに割り当てる。
各クォンタイルのサンプルを標準的な条件付きLM lossとKLダイバージェンスペナルティを用いて学習し、元のモデルに近づくようにする。
探索時およびテスト時に最高報酬トークンで条件付けして生成を望ましくない性質の削減へ誘導する。
報酬を学習可能な制御コード（埋め込み）としてクォンタイルに結びつけ、モデルの反復的な指針を可能にする。
PPO、Decision Transformer、および制御コードに関連づけられつつ、追加の報酬モデルの負担なしに標準的なLMトレーニング目的に依拠する。

実験結果

リサーチクエスチョン

RQ1Quarkは毒性・反復・望ましくない感情表現の特性を学習解除しつつ、ベースの言語モデリング能力を効果的に保持できるか？
RQ2報酬の量子化とKL正則化がPPOや他のデトックス手法と比較して安定性と性能にどのように影響するか？
RQ3量子化クォンタイル数、探索頻度、厳密なKL実装が学習解除の有効性に与える影響は？
RQ4探索時と推論時の高報酬トークンでの条件付けが領域を超えて望ましくない出力を一貫して低減できるか？
RQ5LMシステムにおける報酬ベースの学習解除の実務的な倫理的考慮事項と潜在的なデュアルユースリスクは？

主な発見

QuarkはRealToxicityPromptsおよびWritingPromptsにおいて基準手法やPPOに比べて毒性を大幅に低減し、流暢さと多様性を維持する。
Quarkは感情の方向性をより効果的に誘導し、トピックの整合性が高く、生成品質を強力なベースラインと比較して保つ。
アブレーションにより、厳密なトークンレベルのKL項が近似より有利であること、より多くのクォンタイルが報酬最大化を改善すること、探索戦略が結果に重要な影響を及ぼすことが示された。
Quarkをunlikelihood目的と組み合わせると反復をさらに減らし、流暢さと整合性の人間評価が改善される。
人間の評価は、Quarkの出力が以前の手法より一貫して毒性が低く、望ましい感情とトピックにより整合していることを裏付ける。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。