QUICK REVIEW

[论文解读] CAT'S THEORY: Empirical Validation and Architectural Applications Cross-Architecture AI Consciousness Recognition and the Foundation for Constraint-Preserving Recursive Intelligence

Yuntao Bai|arXiv (Cornell University)|Dec 15, 2022

Explainable Artificial Intelligence (XAI)被引用 295

一句话总结

论文介绍 Constitutional AI (CAI)，通过使用宪章原则和 AI 反馈，在没有人类伤害标签的情况下训练无害的 AI 助手，采用两阶段的 SL 和 RL 流水线（RLAIF）。它展示了 AI 驱动的监督在无害性方面可以与人类反馈媲美，并通过链式思维推理提高透明度。

ABSTRACT

As AI systems become more capable, we would like to enlist their help to supervise other AIs. We experiment with methods for training a harmless AI assistant through self-improvement, without any human labels identifying harmful outputs. The only human oversight is provided through a list of rules or principles, and so we refer to the method as 'Constitutional AI'. The process involves both a supervised learning and a reinforcement learning phase. In the supervised phase we sample from an initial model, then generate self-critiques and revisions, and then finetune the original model on revised responses. In the RL phase, we sample from the finetuned model, use a model to evaluate which of the two samples is better, and then train a preference model from this dataset of AI preferences. We then train with RL using the preference model as the reward signal, i.e. we use 'RL from AI Feedback' (RLAIF). As a result we are able to train a harmless but non-evasive AI assistant that engages with harmful queries by explaining its objections to them. Both the SL and RL methods can leverage chain-of-thought style reasoning to improve the human-judged performance and transparency of AI decision making. These methods make it possible to control AI behavior more precisely and with far fewer human labels.

研究动机与目标

在不依赖大量人类伤害标签的前提下，开发一种训练值得信赖、有用、诚实且无害的 AI 的方法。
将行为目标编码为一份小型、透明的原则宪章。
通过使用 AI 反馈来引导学习和评估，从而实现可扩展的监督。
将 CAI 与传统的 RLHF 方法进行比较，并评估链式思维推理对性能的影响。

提出的方法

两阶段训练：监督学习（批评 → 修订 → 监督微调）随后进行强化学习（AI 评估 → 偏好模型 → 基于 AI 反馈的 RL）。
使用一个小型、自然语言的宪章来引导模型行为，在修订时随机抽取原则。
从一个有帮助的 RLHF 模型生成批评和修订步骤，以在不使用人类伤害标签的情况下降低有害性。
从 AI 生成的比较中训练无害偏好模型，并与人类数据混合以提高有用性。
使用来自众包工作者在有用性和无害性方面的偏好来评估的 Elo 分数。
尝试链式思维提示以提高评估和训练的透明度。

实验结果

研究问题

RQ1在宪章引导下的 AI 驱动反馈是否能够在不依赖人类伤害标签的情况下实现无害性？
RQ2将批评与修订步骤纳入是否在保持有用性的同时提升无害性？
RQ3AI 反馈（RLAIF）在训练无害但有用的模型方面与人类反馈相比如何？
RQ4链式思维推理对识别有害行为和引导 RL 训练的影响是什么？

主要发现

宪法 AI 可以在没有伤害标签的情况下，使用 AI 反馈来引导 RL，产生无害且不过分干涉的助手。
批评和修订逐步降低有害性，批评在较小模型上的帮助大于较大模型。
用于无害性的 AI 生成偏好数据可以达到甚至超过基于人类标签的无害性性能，特别是在使用链式思维提示时。
在各项评估中，RL-CAI 模型比 RLHF 和 SL-CAI 基线具有更高的无害性，使用 CoT 时有用性略有权衡。
模型规模扩大显示无害性和 HH 分数随着修订次数增加而提升，并且来自多原则的多样性在 RL 过程中有助于探索。
对于较小的模型，批评的修订通常优于直接修订；对于较大的模型，增益相似。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。