Skip to main content
QUICK REVIEW

[论文解读] Towards Evaluating AI Systems for Moral Status Using Self-Reports

Ethan Perez, Robert Long|arXiv (Cornell University)|Nov 14, 2023
Psychology of Moral and Emotional Judgment被引用 30
一句话总结

本文概述了一个研究计划,旨在训练 AI 系统提供关于其内部状态的内省自我报告,并评估其在为有关 AI 道德地位的辩论提供信息时的可靠性。它讨论训练方法、评估方案,以及区分内省证据与外在数据的保障措施。

ABSTRACT

As AI systems become more advanced and widely deployed, there will likely be increasing debate over whether AI systems could have conscious experiences, desires, or other states of potential moral significance. It is important to inform these discussions with empirical evidence to the extent possible. We argue that under the right circumstances, self-reports, or an AI system's statements about its own internal states, could provide an avenue for investigating whether AI systems have states of moral significance. Self-reports are the main way such states are assessed in humans ("Are you in pain?"), but self-reports from current systems like large language models are spurious for many reasons (e.g. often just reflecting what humans would say). To make self-reports more appropriate for this purpose, we propose to train models to answer many kinds of questions about themselves with known answers, while avoiding or limiting training incentives that bias self-reports. The hope of this approach is that models will develop introspection-like capabilities, and that these capabilities will generalize to questions about states of moral significance. We then propose methods for assessing the extent to which these techniques have succeeded: evaluating self-report consistency across contexts and between similar models, measuring the confidence and resilience of models' self-reports, and using interpretability to corroborate self-reports. We also discuss challenges for our approach, from philosophical difficulties in interpreting self-reports to technical reasons why our proposal might fail. We hope our discussion inspires philosophers and AI researchers to criticize and improve our proposed methodology, as well as to run experiments to test whether self-reports can be made reliable enough to provide information about states of moral significance.

研究动机与目标

  • 激发对 AI 系统是否可能具有潜在道德意义状态的实证研究。
  • 提出一种训练方案,促成基于内省的自我报告,而非模仿或外在输出。
  • 概述评估标准,以评估 AI 自我报告的可靠性、一致性和可解释性。
  • 讨论哲学与技术挑战,并提出防范偏见与误解的保障措施。

提出的方法

  • 训练模型回答一组广泛的自指问题,给出已知答案,以促进内省。
  • 开发方案以衡量在不同情境和相似模型之间的自我报告一致性。
  • 将可解释性技术融入以用内部相关指标来证实自我报告。
  • 引入干预措施,将内省能力推广到关于道德意义状态的问题。
  • 评估自我报告在多大程度上由内部状态驱动,还是由外在证据或训练诱因引导。

实验结果

研究问题

  • RQ1来自 AI 系统的自我报告能否被推定为足够可靠,以支撑关于有意识状态或其他道德意义条件的主张?
  • RQ2以内省为焦点的训练方法是否产生的自我报告能够推广到关于痛苦、欲望或其他道德意义状态的问题?
  • RQ3如何在 AI 自我报告中将内省证据与外在数据或训练激励区分开?
  • RQ4哪些评估方案最能验证 AI 自我报告的有用性和可信度?
  • RQ5使用自我报告来讨论 AI 道德地位的安全、伦理与方法论风险有哪些?

主要发现

  • 来自当前 AI 系统的自我报告通常不可靠,原因包括训练数据、人工反馈激励以及对人类文本的模仿。
  • 提议的以内省为焦点的训练制度可能增强模型基于内部状态回答自指性问题的能力。
  • 评估自我报告应包括跨情境与模型的一致性检查、置信度/鲁棒性评估,以及可解释性的互证。
  • 缓解措施包括训练诚实性、控制外在证据,以及减少来自非内省训练阶段的偏见。
  • 该方法面临哲学与技术挑战,健壮性取决于严格的实验和批判性审议。

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。