QUICK REVIEW

[论文解读] Online Risk-Averse Planning in POMDPs Using Iterated CVaR Value Function

Yaacov Pariente, Vadim Indelman|arXiv (Cornell University)|Jan 28, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

该论文使用迭代CVaR (ICVaR) 在部分观测的 POMDPs 中进行在线风险规避规划，给出策略评估和稀疏采样的有限时间保证，并将 POMCPOW 与 PFT-DPW 扩展以优化 ICVaR，实验显示在尾部风险方面优于风险中立基线。

ABSTRACT

We study risk-sensitive planning under partial observability using the dynamic risk measure Iterated Conditional Value-at-Risk (ICVaR). A policy evaluation algorithm for ICVaR is developed with finite-time performance guarantees that do not depend on the cardinality of the action space. Building on this foundation, three widely used online planning algorithms--Sparse Sampling, Particle Filter Trees with Double Progressive Widening (PFT-DPW), and Partially Observable Monte Carlo Planning with Observation Widening (POMCPOW)--are extended to optimize the ICVaR value function rather than the expectation of the return. Our formulations introduce a risk parameter $α$, where $α= 1$ recovers standard expectation-based planning and $α< 1$ induces increasing risk aversion. For ICVaR Sparse Sampling, we establish finite-time performance guarantees under the risk-sensitive objective, which further enable a novel exploration strategy tailored to ICVaR. Experiments on benchmark POMDP domains demonstrate that the proposed ICVaR planners achieve lower tail risk compared to their risk-neutral counterparts.

研究动机与目标

在部分可观测性下激发对风险厌恶规划的需求，以提升安全性与鲁棒性。
将 ICVaR 作为 POMDP 值函数的动态风险度量引入。
开发用于优化 ICVaR 而非期望回报的策略评估与在线规划算法。

提出的方法

为来源于 POMDP 的 PB-MDP 定义 ICVaR 动作价值与价值函数。
开发具有有限时间性能保证的 ICVaR 策略评估算法（算法1）。
将 Sparse Sampling 扩展为 ICVaR Sparse Sampling 以获得风险敏感规划（算法2）。
将基于 MCTS 的规划器（POMCPOW 和 PFT-DPW）扩展为优化 ICVaR（算法5 和 4）。
提出基于 ICVaR 保证的探索策略（ICVaR Progressive Widening）。

实验结果

研究问题

RQ1如何将 ICVaR 纳入 POMDP 的在线规划？
RQ2在 PB-MDPs 中 ICVaR 策略评估和规划可以建立哪些有限时间保证？
RQ3在基准 POMDPs 中，基于 ICVaR 的规划是否比风险中立规划降低尾部风险？
RQ4在优化 ICVaR 而非期望回报时，探索策略应如何调整？
RQ5在不同的 POMDP 领域中，尾部风险降低的实际收益有哪些？

主要发现

Environment	POMCPOW	ICVaR-POMCPOW	PFT-DPW	ICVaR-PFT-DPW
LaserTag	15.06±0.40	12.47±0.46	26.04±0.91	16.33±0.61
LightDark	25.73±0.96	16.72±0.08	37.68±1.68	18.52±0.23

ICVaR 规划在基准 POMDP 领域中实现的尾部风险低于其风险中立对比对象。
ICVaR 的策略评估与 ICVaR Sparse Sampling 具有限定时间性能保证。
基于 ICVaR 的 MCTS 规划器（ICVaR-POMCPOW 和 ICVaR-PFT-DPW）在实验中显示出尾部风险改进。
在 LaserTag 与 LightDark 领域中，尾部风险改进伴随着尾部指标的显著降低。
引入了针对 ICVaR 目标的探索策略，取代了基于 Hoeffding 的标准探索。
实验表明 ICVaR 规划在提供的基准测试中优于风险中立基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。