QUICK REVIEW

[论文解读] What are human values, and how do we align AI to them?

Oliver Klingefjord, Ryan Lowe|arXiv (Cornell University)|Mar 27, 2024

Ethics and Social Impacts of AI被引用 6

一句话总结

本文介绍 Moral Graph Elicitation (MGE) 用于引出并调和人类价值以形成一个称为 moral graph 的正式对齐目标；通过一项 500 人的美国案例研究进行评估，展示在六个标准下具备有希望的合法性、公平性和鲁棒性。

ABSTRACT

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.

研究动机与目标

定义一个对齐目标必须满足的六个标准，以以人类价值来塑造模型行为。
将 moral graph 作为新的对齐目标，并以哲学基础的价值卡片作为依据。
描述 Moral Graph Elicitation 过程，以生成并调和价值观。
通过案例研究证明 MGE 满足六个标准并产生有意义的参与者反馈。

提出的方法

将价值卡片作为对特定情境价值的具体化封装。
构建一个道德图，表示情境、一对价值以及在该情境中哪一个价值更明智。
使用大型语言模型对参与者进行访谈，并在具体情境中揭示价值观。
在泰勒（Taylor, 1977）与张（Chang, 2004a）的启发基础上，应用迭代的调和过程以在各情境中确定更明智的价值。
在代表性样本的 500 名美国人身上，通过三条具分歧性提示来评估该过程。
将道德图与现有对齐目标在诸如合法性、可听性和鲁棒性等标准进行比较。

实验结果

研究问题

RQ1如何以情境特定和可解释的形式引出人类对价值的多样输入？
RQ2如何将所引出的价值观调和成适用于语言模型的细粒度、可泛化、可扩展的对齐目标？
RQ3Moral Graph Elicitation 过程是否能产生一个合法、鲁棒、可审计且可扩展的对齐目标？
RQ4在将 MGE 应用于现实世界提示时，实际结果和参与者认知是什么？

主要发现

参与者报告了较高的代表性对齐，89.1% 的人觉得过程很好地代表了他们的观点。
同样有 89% 的参与者认为最终的道德图对他们的输入是公平的。
该方法倾向于显现所谓的专家价值观（例如来自对已征求堕胎建议的女性的价值观），而无需事先定义专家身份。
六个标准框架（细粒度、可泛化、可扩展、鲁棒、合法、可审计）在案例研究中展现出有希望的实现。
MGE 鼓励在道德图中出现更明智的价值，通过对比实现情境特定考量的平衡。
作者认为通过道德图将人类价值对齐可以与法律和更广泛的 AI 伦理工作互补。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。