QUICK REVIEW

[论文解读] Towards Guaranteed Safe AI: A Framework for Ensuring Robust and Reliable AI Systems

David Dalrymple, Joar Skalse|arXiv (Cornell University)|May 10, 2024

Adversarial Robustness in Machine Learning被引用 12

一句话总结

建议一个保證安全（GS）AI 作为一个框架，由一个世界模型、一个形式化安全规范和一个验证器构成，以为 AI 系统提供高保证的安全保障。

ABSTRACT

Ensuring that AI systems reliably and robustly avoid harmful or dangerous behaviours is a crucial challenge, especially for AI systems with a high degree of autonomy and general intelligence, or systems used in safety-critical contexts. In this paper, we will introduce and define a family of approaches to AI safety, which we will refer to as guaranteed safe (GS) AI. The core feature of these approaches is that they aim to produce AI systems which are equipped with high-assurance quantitative safety guarantees. This is achieved by the interplay of three core components: a world model (which provides a mathematical description of how the AI system affects the outside world), a safety specification (which is a mathematical description of what effects are acceptable), and a verifier (which provides an auditable proof certificate that the AI satisfies the safety specification relative to the world model). We outline a number of approaches for creating each of these three core components, describe the main technical challenges, and suggest a number of potential solutions to them. We also argue for the necessity of this approach to AI safety, and for the inadequacy of the main alternative approaches.

研究动机与目标

定义 guaranteed safe (GS) AI 并论证高保证的定量安全保障。
描述世界模型、一个安全规范和一个验证器如何交互以提供保障。
调查构建每个 GS AI 组件的潜在方法并识别关键挑战。
阐述 GS AI 可以解决的实际问题并讨论可行性和收益。

提出的方法

形式化定义 GS AI 并解释定量安全保障的标准。
概述三大核心组件（世界模型、安全规范、验证器）及其角色。
讨论世界模型的可能实现，从完全无模型到形式验证的物理法则抽象的光谱。
描述安全规范及其构建方法，超越传统的基于奖励的表述。
解释验证如何相对于世界模型产生形式证明、概率界限或收敛保证。
处理与现有收容或封装方法以及更广泛的 GS AI 议程的整合。

实验结果

研究问题

RQ1对 AI 系统而言，什么构成高保证的定量安全保障？
RQ2如何组合世界模型、安全规范和验证器以产生严格的安全保障？
RQ3构建鲁棒世界模型的可行策略有哪些，它们在可解释性与准确性方面的权衡是什么？
RQ4安全规范除了基于奖励的目标还可以采取哪些形式，如何在验证中实现？
RQ5GS AI 方法如何扩展到现实世界的安全关键应用，同时保持可行性？

主要发现

GS AI 将安全性框定为通过世界模型、安全规范与验证器产生的定量保证。
一系列世界模型方法从无模型到形式验证的物理抽象，并各自具有不同的安全含义。
验证可以提供形式证明或概率界限，并且可能依赖受限的资源和模型不确定性。
世界模型可以是人工 craft 或机器学习得到，概率编程和贝叶斯方法使对理论的可处理推断成为可能。
该方法主张需要基于模型的验证，以在非平稳、复杂环境中实现长时间范围的安全保障。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。