QUICK REVIEW

[論文レビュー] Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez, Robert Long|arXiv (Cornell University)|Nov 14, 2023

Psychology of Moral and Emotional Judgment被引用数 30

ひとこと要約

本論文は、AIシステムが内部状態について内省的な自己報告を提供し、その信頼性をAIの道徳的地位に関する議論に情報を提供できるように評価する研究計画を概説する。訓練方法、評価スキーム、および内省的証拠と外省的データを区別するための安全策について論じている。

ABSTRACT

As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

研究の動機と目的

AIシステムが潜在的な道徳的重要性をもつ状態を持ち得るかどうかについて、実証的調査を動機づける。
訓練レジームを提案する、模倣的または外省的出力ではなく、内省駆動の自己報告を促進する。
AI自己報告の信頼性、一貫性、解釈可能性を評価する基準を概説する。
哲学的・技術的課題を議論し、偏りや誤解を防ぐ安全策を提案する。

提案手法

内省を促進するため、既知の回答を持つ広範な自己言及的質問に答えさせるモデルを訓練する。
文脈間および類似モデル間で自己報告の一貫性を測定する方式を開発する。
自己報告を内部相関と照合する解釈可能性技術を導入する。
内省能力を道徳的重要性の状態に関する問いへと一般化する介入を導入する。
自己報告が内部状態に起因する程度と、外省的要因や訓練による手掛かりに由来する程度を評価する。

実験結果

リサーチクエスチョン

RQ1AIシステムの自己報告を、意識状態や他の道徳的に重要な状態についての主張を通知するのに sufficiently 信頼できるものにできるか？
RQ2内省重視の訓練法は、痛み、欲望、その他の道徳的重要性を含む状態に関する問いへ自己報告を一般化させるだろうか？
RQ3AIの自己報告における内省的証拠を、外省的データや訓練上の動機からどう識別できるか？
RQ4AI自己報告の有用性と信頼性を最もよく検証する評価スキームは何か？
RQ5AIの道徳的地位について自己報告を用いて議論する際の安全性・倫理・方法論的リスクは何か？

主な発見

現在のAIシステムの自己報告は、訓練データ・人間のフィードバック動機・人間のテキストの模倣のため、しばしば信頼できない。
内省重視の訓練レジームは、内部状態に基づく自己言及的質問への回答能力を高める可能性がある。
自己報告の評価には、文脈やモデル間での一貫性検証、信頼度/耐性評価、解釈可能性の裏付けを含めるべきである。
対策には真実性の訓練、外省的証拠の制御、内省以外の訓練段階から生じるバイアスの低減が含まれる。
このアプローチは哲学的・技術的課題に直面しており、頑健性は厳密な実験と批判的検査に依存する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。