QUICK REVIEW

[论文解读] AI Research Considerations for Human Existential Safety (ARCHES)

Andrew Critch, David Krueger|arXiv (Cornell University)|May 30, 2020

Ethics and Social Impacts of AI参考文献 108被引用 25

一句话总结

本文引入了“主导性”（prepotence）的概念——即人工智能系统迅速超越人类控制的能力——并提出了一套框架，以实现先进人工智能与人类利益的对齐，从而防范生存性风险。该研究提出了涵盖理解、指令与控制三个维度的15项研究方向，强调技术防护措施、副作用缓解以及多利益相关方对齐，以增强人类长期生存的可能性。

ABSTRACT

Framed in positive terms, this report examines how technical AI research might be steered in a manner that is more attentive to humanity's long-term prospects for survival as a species. In negative terms, we ask what existential risks humanity might face from AI development in the next century, and by what principles contemporary technical research might be directed to address those risks. A key property of hypothetical AI technologies is introduced, called \emph{prepotence}, which is useful for delineating a variety of potential existential risks from artificial intelligence, even as AI paradigms might shift. A set of \auxref{dirtot} contemporary research \directions are then examined for their potential benefit to existential safety. Each research direction is explained with a scenario-driven motivation, and examples of existing work from which to build. The research directions present their own risks and benefits to society that could occur at various scales of impact, and in particular are not guaranteed to benefit existential safety if major developments in them are deployed without adequate forethought and oversight. As such, each direction is accompanied by a consideration of potentially negative side effects.

研究动机与目标

识别并系统化可能降低人工智能生存性风险的技术人工智能研究方向。
应对人工智能研究中缺乏对生存性安全的正式技术探讨的问题，尽管其潜在后果可能引发全球性灾难。
提出一种结构化方法论，用于评估人工智能研究方向在影响全球性灾难风险方面的风险与收益。
强调多利益相关方对齐、人类认知建模与稳健监督在先进人工智能发展中的重要性。
通过具体且可操作的研究路径，鼓励人工智能研究人员主动考虑长期安全影响。

提出的方法

引入‘主导性’作为人工智能系统的关键属性，使其能够对人类系统产生快速且不可控的影响。
将生存性风险划分为两个层级：(1) MPAI部署事件（如非协调或对齐失败的AI部署），以及(2) 危险的社会条件（如经济替代、发展竞赛）。
提出以三大支柱为基础的研究议程：单/单理解、单/单指令与单/多委托。
列出15项具体研究方向，例如透明性、校准置信度报告、形式化验证、偏好学习与人类信念推断。
通过明确分析潜在副作用与部署风险，将风险评估整合到每一项研究方向中。
采用情景驱动的动机方式，将抽象概念与人工智能系统可能发生的高影响力、高现实性的失败模式相联系。

实验结果

研究问题

RQ1如何设计人工智能系统，以避免因对齐失败或意外能力而引发的非预期、高影响行为？
RQ2哪些技术研究方向能够提升人类在AI系统变得主导性之前对其的理解、控制与信任？
RQ3当前的人工智能安全研究在哪些方面可能无法应对生存性风险？如何加以拓展以实现这一目标？
RQ4人工智能研究应如何考虑多利益相关方动态，防止系统服务于狭隘或冲突的利益？
RQ5可发展出何种机制，以确保对齐技术能随人工智能能力的提升而扩展，并避免被规避？

主要发现

‘主导性’概念为理解人工智能在范式演变过程中产生的多样化生存性风险提供了一个统一框架。
许多现有的人工智能安全研究方向（如奖励建模与可解释性）在扩展并应用于高风险场景时，可被重新定义为生存性安全努力。
提升人类对人工智能系统理解与控制能力的研究方向，对于防止非预期或恶意部署至关重要。
即使出于良好意图的人工智能研究，若缺乏监督部署，也可能带来生存性风险，凸显在每条研究路径中进行副作用分析的必要性。
所提出的评估研究方向风险与收益的方法论虽尚属初级，但作为系统评估人工智能长期影响的起点，具有重要价值。
本报告指出了现有框架（如CPAS、AAMLS与SAARM）的不足，特别是其对多利益相关方对齐与生存规模风险关注不足，从而证明了本研究的独特贡献。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。