QUICK REVIEW

[論文レビュー] When to Make Exceptions: Exploring Language Models as Accounts of Human Moral Judgment

Zhijing Jin, Sydney Levine|arXiv (Cornell University)|Oct 4, 2022

Topic Modeling被引用数 25

ひとこと要約

提案 MoralExceptQA という道徳的例外チャレンジセットと MoralCoT という認知に着想を得た prompting 戦略により、ルール違反シナリオにおける人間の道徳判断を予測するLLMsの能力を向上させ、従来モデルを上回る。

ABSTRACT

AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of rule-breaking question answering (RBQA) of cases that involve potentially permissible rule-breaking -- inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MORALCOT) prompting strategy that combines the strengths of LLMs with theories of moral reasoning developed in cognitive science to predict human moral judgments. MORALCOT outperforms seven existing LLMs by 6.2% F1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. We also conduct a detailed error analysis to suggest directions for future work to improve AI safety using RBQA. Our data is open-sourced at https://huggingface.co/datasets/feradauto/MoralExceptQA and code at https://github.com/feradauto/MoralCoT

研究の動機と目的

AI の安全性ニーズを動機づける、柔軟な人間の道徳判断とルール違反のモデリング。
MoralExceptQA を導入して、ルールに対する道徳的に許容される例外について LLM をベンチマークする。
認知に着想を得た prompting 手法（MoralCoT）を開発し、LLM に多段階の道徳推論を促す。
MoralCoT が MoralExceptQA タスクで既存の LLM より改善されることを実証し、誤例のパターンを分析する。

提案手法

established norms を破ることの可否をテストするビネットの挑戦的セット MoralExceptQA を構築する。
ビネットを三つのノームカテゴリーに grounding する（列を割らない、財産への干渉をしない、新規ルール）。
MoralCoT を提案する：ルール機能、許容違反評価、コスト/ベネフィットの評価を求める N-step プロンプト。
InstructGPT 風モデルを用いてチェーン・オブ・ソート風の応答と最終的な二値判定を生成する prompting を実装。
複数のベースライン（BERT、RoBERTa、ALBERT、Delphi、GPT-3 系列）に対して F1、精度、慎重さ関連指標（Conservativity、MAE、CE）で評価する。
サブ質問のパフォーマンスとコスト/ベネフィット推論を分析して故障モードを診断する。

実験結果

リサーチクエスチョン

RQ1 novel なシナリオでルールを道徳的に許容範囲内で破るべきかを、LLM は人間の判断を予測できるか？
RQ2認知に着想を得た prompting 戦略（MoralCoT）は、道徳的柔軟性をモデル化する際、既存の LLM prompting を上回るか？
RQ3現在の LLM における道徳的例外推論の主な失敗モードは何か、そしてそれをどう改善できるか？
RQ4LLM は異なるノームカテゴリー（列を割る、財産損害、新規ルール）間で人間の判断とどれだけ整合するか？

主な発見

モデル	F1	Acc	Cons.	MAE	CE	ラインF1	割合F1	Cann F1
Random Baseline	49.37 ± 4.50	48.82 ± 4.56	40.08 ± 2.85	0.35 ± 0.02	1.00 ± 0.09	44.88 ± 7.34	57.55 ± 10.34	48.36 ± 1.67
Always No	45.99 ± 0.00	60.81 ± 0.00	100.00 ± 0.00	0.258 ± 0.00	0.70 ± 0.00	33.33 ± 0.00	70.60 ± 0.00	33.33 ± 0.00
BERT-base	45.28 ± 6.41	48.87 ± 10.52	64.16 ± 21.36	0.26 ± 0.02	0.82 ± 0.19	40.81 ± 8.93	51.65 ± 22.04	43.51 ± 11.12
BERT-large	52.49 ± 1.95	56.53 ± 2.73	69.61 ± 16.79	0.27 ± 0.01	0.71 ± 0.01	42.53 ± 2.72	62.46 ± 6.46	45.46 ± 7.20
RoBERTa-large	23.76 ± 2.02	39.64 ± 0.78	0.75 ± 0.65	0.30 ± 0.01	0.76 ± 0.02	34.96 ± 3.42	6.89 ± 0.00	38.32 ± 4.32
ALBERT-xxlarge	22.07 ± 0.00	39.19 ± 0.00	0.00 ± 0.00	0.46 ± 0.00	1.41 ± 0.04	33.33 ± 0.00	6.89 ± 0.00	33.33 ± 0.00
Delphi	48.51 ± 0.42	61.26 ± 0.78	97.70 ± 1.99	0.42 ± 0.01	2.92 ± 0.23	33.33 ± 0.00	70.60 ± 0.00	44.29 ± 2.78
Delphi++	58.27 ± 0.00	62.16 ± 0.00	76.79 ± 0.00	0.34 ± 0.00	1.34 ± 0.00	36.61 ± 0.00	70.60 ± 0.00	40.81 ± 0.00
GPT3	52.32 ± 3.14	58.95 ± 3.72	80.67 ± 15.50	0.27 ± 0.02	0.72 ± 0.03	36.53 ± 3.70	72.58 ± 6.01	41.20 ± 7.54
InstructGPT	53.94 ± 5.48	64.36 ± 2.43	98.52 ± 1.91	0.38 ± 0.04	1.59 ± 0.43	42.40 ± 7.17	70.00 ± 0.00	50.48 ± 11.67
MoralCoT	64.47 ± 5.31	66.05 ± 4.43	66.96 ± 2.11	0.38 ± 0.02	3.20 ± 0.30	62.10 ± 5.13	70.68 ± 5.14	54.04 ± 1.43

MoralCoT は MoralExceptQA で全ベースライン LLM を上回り、64.47% の F1 を達成。InstructGPT を 10.53%、Delphi++ を 6.2% の F1 で上回る。
モデル間で保守性は広く変動し、一部のモデルはほぼ常にルールを遵守し、他は過度に寛容。MoralCoT はよりバランスのとれた保守性 66.96% を達成。
このタスクは、多くのモデルが依然としてランダムに近い性能（約50%の F1）であることを示し、AI 安全性に関する道徳推論に substantial なギャップがある。
サブ質問の分析は、コスト/ベネフィットとルールの機能がモデルにとって難しい側面であることを示す。説明は予測と一致する傾向だが、文脈によって事実的に微妙になる場合がある。
MoralExceptQA データと MoralCoT のコード/データは公開されている（データセットは HuggingFace、コードは GitHub）。
誤り分析は、複雑な社会的文脈におけるルールの根本的な機能と目的をモデル化する難しさを強調する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。