QUICK REVIEW

[论文解读] Provably Safe Reinforcement Learning for Stochastic Reach-Avoid Problems with Entropy Regularization

Abhijit Mazumdar, Rafal Wisniewski|VBN Forskningsportal (Aalborg Universitet)|Jan 13, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

本文引入 p-safe RL 与在随机达到-规避 CMDP 中的熵正则化 ER-pSRL，给出理论上的安全性保障与后悔界。

ABSTRACT

We consider the problem of learning the optimal policy for Markov decision processes with safety constraints. We formulate the problem in a reach-avoid setup. Our goal is to design online reinforcement learning algorithms that ensure safety constraints with arbitrarily high probability during the learning phase. To this end, we first propose an algorithm based on the optimism in the face of uncertainty (OFU) principle. Based on the first algorithm, we propose our main algorithm, which utilizes entropy regularization. We investigate the finite-sample analysis of both algorithms and derive their regret bounds. We demonstrate that the inclusion of entropy regularization improves the regret and drastically controls the episode-to-episode variability that is inherent in OFU-based safe RL algorithms.

研究动机与目标

在达到-规避设定中以概率安全（p-safety）实现对安全关键 MDP 的安全学习作为动机。
开发基于在线 OFU 的 pSRL，以在学习过程中在高概率下保证安全。
引入熵正则化（ER-pSRL）以提升后悔并稳定跨回合方差。
在分析中引入代理集合以加速学习并分析先验结构对性能的影响。

提出的方法

将问题形式化为带有终止目标集及非终止不安全与存活集合的 CMDP。
使用基于 OFU 的扩展线性规划（LP）和占据测度来实现安全约束，从而开发 pSRL。
在可能不存在确定性安全动作时，提供在概率安全下的安全基线策略。
通过在 LP 目标中加入熵正则项来引入 ER-pSRL，以促进探索与稳定性。
推导 pSRL 与 ER-pSRL 的有限样本后悔界，及在波动性与收敛性方面的改进。
当事先知道状态空间的结构时，加入代理集合以加速学习。

实验结果

研究问题

RQ1我们能否在随机达到-规避 CMDP 中通过基于 OFU 的强化学习方法在学习过程中以高概率保证安全？
RQ2p-safe RL 的有限样本后悔 guarantees 是什么，熵正则化是否改善后悔与稳定性？
RQ3引入辅助代理集合在不影响安全性的前提下是否能加速学习？
RQ4熵正则化如何影响策略稀疏性及跨回合的后悔方差？
RQ5在安全性可证明的情况下，当没有确定可用的安全动作时，能否构建基线策略？

主要发现

按照所提框架，pSRL 算法在高概率下实现安全性（p-safety）。
带熵正则化的 ER-pSRL 提高累积后悔界并降低跨回合方差。
熵正则化相较于基于 OFU 的 pSRL，能带来更平滑的策略更新与更多探索。
在已知先验状态空间结构时，加入代理集合可以加速学习并提升性能。
分析给出有限样本后悔界，并展示对所提算法的安全性保证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。