QUICK REVIEW

[論文レビュー] What are human values, and how do we align AI to them?

Oliver Klingefjord, Ryan Lowe|arXiv (Cornell University)|Mar 27, 2024

Ethics and Social Impacts of AI被引用数 6

ひとこと要約

本論文は Moral Graph Elicitation (MGE) を用いて、人間の価値を moral graph と呼ばれる formal な整合ターゲットへエリート・調和させることを提案し、500-person US case study によって妥当性・公平性・頑健性を含む六つの基準にわたり有望であることを示している。

ABSTRACT

There is an emerging consensus that we need to align AI systems with human values (Gabriel, 2020; Ji et al., 2024), but it remains unclear how to apply this to language models in practice. We split the problem of "aligning to human values" into three parts: first, eliciting values from people; second, reconciling those values into an alignment target for training ML models; and third, actually training the model. In this paper, we focus on the first two parts, and ask the question: what are "good" ways to synthesize diverse human inputs about values into a target for aligning language models? To answer this question, we first define a set of 6 criteria that we believe must be satisfied for an alignment target to shape model behavior in accordance with human values. We then propose a process for eliciting and reconciling values called Moral Graph Elicitation (MGE), which uses a large language model to interview participants about their values in particular contexts; our approach is inspired by the philosophy of values advanced by Taylor (1977), Chang (2004), and others. We trial MGE with a representative sample of 500 Americans, on 3 intentionally divisive prompts (e.g. advice about abortion). Our results demonstrate that MGE is promising for improving model alignment across all 6 criteria. For example, almost all participants (89.1%) felt well represented by the process, and (89%) thought the final moral graph was fair, even if their value wasn't voted as the wisest. Our process often results in "expert" values (e.g. values from women who have solicited abortion advice) rising to the top of the moral graph, without defining who is considered an expert in advance.

研究の動機と目的

alignment target を形作るために、人間の価値を満たすべき六つの基準を定義する。
価値カードを哲学に根ざした新しい整合ターゲットとして道徳的グラフとして作成する。
Moral Graph Elicitation の過程を記述し、価値を生成・調和させる。
MGE が六つの基準を満たし、意味のある参加者のフィードバックをもたらすことを事例を通じて示す。

提案手法

context-specific な価値を具体的に encapsulate する価値カードを提案する。
context を表す道徳的グラフを構築し、対になる二つの価値とその状況でどちらが賢明かを示す。
large language model を用いて参加者にインタビューし、具体的な文脈で価値を表へ surface する。
Taylor (1977) および Chang (2004a) に基づく反復的な調整過程を適用し、文脈ごとに賢明な価値を決定する。
代表的な 500 Americans を三つの分断的なプロンプトで対象として過程を評価する。
moral graph を既存の整合ターゲットと比較し、正統性・可聴性・頑健性などの基準で評価する。

実験結果

リサーチクエスチョン

RQ1 価値に関する多様な人間の入力を、文脈特異的で解釈可能な形でどのように elicited できるか？
RQ2 elicited な価値を、言語モデルのための細かな、一般化可能で拡張性のある整合ターゲットへ調和させるには？
RQ3 Moral Graph Elicitation の過程は、合法的で頑健で監査可能・拡張可能な整合ターゲットを生み出せるか？
RQ4 実世界のプロンプトに MGE を適用した際の実践的な成果と参加者の認識はどうなるか？

主な発見

参加者は表現的な整合性が高いと回答し、89.1% がこの過程によって自分たちの代表性を感じた。
同様に 89% の参加者が最終的な道徳的グラフを自分の入力に対して公正と判断した。
アプローチは expert value と呼ばれる価値を表に出す傾向があるが、専門家の地位を事前に定義しない。
六つの基準フレームワーク（細かな・一般化可能・拡張性・頑健・正統・監査可能）は、事例研究の中で有望な形で満たされた。
MGE は道徳的グラフの中で賢明な価値の出現を促し、文脈特有の配慮を価値の比較を通じてバランスさせる。
著者らは、道徳的グラフを介して人間の価値へ整合することは、法規とより広い AI 倫理の取り組みを補完できると主張する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。