QUICK REVIEW

[论文解读] Quark: Controllable Text Generation with Reinforced Unlearning

Ximing Lu, Sean Welleck|arXiv (Cornell University)|May 26, 2022

Topic Modeling被引用 45

一句话总结

Quark 引入 Quantized Reward Konditioning，这是一个在线-离线框架，通过对奖励标记进行条件化并使用 KL 散度惩罚来消除不良的语言模型行为，在毒性、情感和重复控制方面优于 PPO 基线。

ABSTRACT

Large-scale language models often learn behaviors that are misaligned with user expectations. Generated text may contain offensive or toxic language, contain significant repetition, or be of a different sentiment than desired by the user. We consider the task of unlearning these misalignments by fine-tuning the language model on signals of what not to do. We introduce Quantized Reward Konditioning (Quark), an algorithm for optimizing a reward function that quantifies an (un)wanted property, while not straying too far from the original model. Quark alternates between (i) collecting samples with the current language model, (ii) sorting them into quantiles based on reward, with each quantile identified by a reward token prepended to the language model's input, and (iii) using a standard language modeling loss on samples from each quantile conditioned on its reward token, while remaining nearby the original language model via a KL-divergence penalty. By conditioning on a high-reward token at generation time, the model generates text that exhibits less of the unwanted property. For unlearning toxicity, negative sentiment, and repetition, our experiments show that Quark outperforms both strong baselines and state-of-the-art reinforcement learning methods like PPO (Schulman et al. 2017), while relying only on standard language modeling primitives.

研究动机与目标

激励并解决大型语言模型中的不对齐行为（毒性、重复、非期望情感）。
开发一种事后去学习方法，在保留核心生成能力的同时将输出引导离开不期望的属性。
使用标准的语言模型原语，在不需要完整强化学习机构的情况下创建一个轻量级、可微的训练循环。
在毒性、情感控制和重复任务上展示鲁棒性，并与强基线进行比较。

提出的方法

提出 Quantized Reward Konditioning (Quark)，一种用于（去）学习的在线、离策略算法，分为三个阶段：探索、量化和学习。
从当前语言模型收集样本，并通过在输入前置一个奖励标记来将每个样本分配到一个奖励分位数。
对来自每个分位数的样本进行标准的条件语言模型损失训练，并使用 KL 散度惩罚以保持接近原始模型。
在探索阶段和测试时对最高奖励标记进行条件化，以引导生成向减少不良属性的方向。
用与分位数绑定的学习控制码（嵌入）表示奖励，从而实现对模型的迭代引导。
与 PPO、Decision Transformer 和控制码相关，同时依赖标准的 LM 训练目标，不增加额外的奖励模型负担。

实验结果

研究问题

RQ1Quark 能否在保留基础语言建模能力的同时有效地去学习出毒性、重复性和不期望情感属性？
RQ2将奖励量化和 KL 正则化如何影响与 PPO 与其他去毒方法相比的稳定性和性能？
RQ3分位数数量、探索频率以及确切的 KL 实现对去学习效果有何影响？
RQ4在探索和推断过程中对高奖励标记进行条件化是否能在跨领域上可靠地减少不良输出？
RQ5在语言模型系统中基于奖励的去学习的实际伦理考量与潜在双重用途风险是什么？

主要发现

Quark 相较于基线和 PPO 在 RealToxicityPrompts 与 WritingPrompts 上显著降低毒性，同时保持流畅性和多样性。
Quark 在情感引导方面更有效，并在保持生成质量的同时实现更高的主题连贯性，相对于强基线。
消融研究表明，精确的令牌级 KL 项优于近似，更多的分位数提高奖励最大化，探索策略对结果有关键影响。
将 Quark 与不太可能性目标结合进一步减少重复，并提升人类对流畅性与连贯性的评价。
人工评估证实，Quark 的输出一直更少毒性，并且在期望的情感和主题方面比以往方法更一致。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。