QUICK REVIEW

[论文解读] Safe Reinforcement Learning in Constrained Markov Decision Processes

Akifumi Wachi, Yanan Sui|arXiv (Cornell University)|Aug 15, 2020

Reinforcement Learning in Robotics被引用 54

一句话总结

介绍 SNO-MDP，一种在未知安全约束的约束性 MDPs 中的安全近似最优强化学习算法，以及用于加速安全探索的 ES2；在 GP-Safety-Gym 和 Mars 地形数据中得到验证。

ABSTRACT

Safe reinforcement learning has been a promising approach for optimizing the policy of an agent that operates in safety-critical applications. In this paper, we propose an algorithm, SNO-MDP, that explores and optimizes Markov decision processes under unknown safety constraints. Specifically, we take a stepwise approach for optimizing safety and cumulative reward. In our method, the agent first learns safety constraints by expanding the safe region, and then optimizes the cumulative reward in the certified safe region. We provide theoretical guarantees on both the satisfaction of the safety constraint and the near-optimality of the cumulative reward under proper regularity assumptions. In our experiments, we demonstrate the effectiveness of SNO-MDP through two experiments: one uses a synthetic data in a new, openly-available environment named GP-SAFETY-GYM, and the other simulates Mars surface exploration by using real observation data.

研究动机与目标

为安全和奖励需权衡的安全关键应用场景，推动安全强化学习。
开发一个分步方法，先学习安全约束，然后在认证的安全区域内优化累积奖励。
在正则性假设下，提供类似 PAC-MDP 的理论保证，确保安全满足和近似最优奖励。
提出 ES2，在保持安全保证的同时加速对安全的探索。
通过合成的 GP-Safety-Gym 实验和 Mars 地形数据仿真来证明有效性。

提出的方法

使用高斯过程对安全性和奖励建模，以捕捉未知函数并推导乐观/悲观的安全空间。
定义带可达性与可返回性约束的悲观安全空间 S_t^- 和乐观安全空间 S_t^+，以确保安全扩展。
使用由 GP 推导的置信区间，配合 beta_t 和 alpha_t，以高概率界定 g(s) 和 r(s)。
将 SNO-MDP 实现为两阶段算法：先扩展安全区域，然后在认证的安全区域内优化奖励。
引入 ES2，在进一步探索无法提升奖励时通过评估辅助 MDP M_y 和停止条件提前停止安全探索。
在 RKHS 和 Lipschitz 假设下，给出理论保证（定理 1：安全性/完备性，定理 2：近似最优性，定理 3：在 ES2 下的近似最优性）。

实验结果

研究问题

RQ1在约束性 MDP 中学习未知奖励函数的同时是否可以保证安全约束？
RQ2分步方法（先学习安全性，再优化奖励）是否能在具备安全保证的前提下获得近似最优的策略？
RQ3如何在不牺牲安全保证的前提下加速对安全的探索？
RQ4在正则性假设下，SNO-MDP 及其 ES2 变体是否仍然满足理论上的 PAC-MDP 风格保证？

主要发现

SNO-MDP 在探索期间以高概率保证安全，并在安全区域内实现接近最优的累积奖励。
该算法收敛到包含近似 ε_g-安全可达集的安全区域，在给定条件下确保安全完备性。
SNO-MDP 在充分探索后达到 ε_V-接近最优的奖励，且具有高概率的安全保证（PAC-MDP 风格）。
ES2 在进一步探索不能提升奖励时停止安全探索，减少探索步骤，同时保持近似最优性保证。
P-ES2 在对安全性进行概率化处理方面提供了实用改进，尽管没有正式的近似最优保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。