QUICK REVIEW

[论文解读] Distribution Estimation in Discounted MDPs via a Transformation.

Shuai Ma, Jia Yuan Yu|arXiv (Cornell University)|Apr 16, 2018

Formal Methods in Verification参考文献 24被引用 2

一句话总结

本文提出了一种变换，用于在折扣 MDP 中将基于转移的奖励函数转换为基于状态的奖励函数，同时保持累积折扣奖励的分布不变。通过实现对奖励分布的精确估计——尤其是针对风险敏感目标（如风险价值）——该方法确保了即使奖励依赖于状态转移，也能实现正确的价值估计，适用于近似正态分布的奖励。

ABSTRACT

Although the general deterministic reward function in MDPs takes three arguments - current state, action, and next state; it is often simplified to a function of two arguments - current state and action. The former is called a transition-based reward function, whereas the latter is called a state-based reward function. When the objective is a function of the expected cumulative reward only, this simplification works perfectly. However, when the objective is risk-sensitive - e.g., depends on the reward distribution, this simplification leads to incorrect values of the objective. This paper studies the distribution estimation of the cumulative discounted reward in infinite-horizon MDPs with finite state and action spaces. First, by taking the Value-at-Risk (VaR) objective as an example, we illustrate and analyze the error from the above simplification on the reward distribution. Next, we propose a transformation for MDPs to preserve the reward distribution and convert transition-based reward functions to deterministic state-based reward functions. This transformation works whether the transition-based reward function is deterministic or stochastic. Lastly, we show how to estimate the reward distribution after applying the proposed transformation in different settings, provided that the distribution is approximately normal.

研究动机与目标

解决在风险敏感 MDP 中，将基于转移的奖励（状态-动作-下一状态）简化为基于状态的奖励（状态-动作）所引入的误差问题。
在使用基于状态的奖励函数时，保持累积折扣奖励的真实分布。
在变换后，实现对奖励分布的精确估计，特别是针对风险敏感目标（如风险价值）的估计。
开发一种适用于确定性和随机基于转移的奖励函数的通用变换方法。
在变换后假设奖励分布近似正态的前提下，提供一种奖励分布估计的框架。

提出的方法

提出一种变换，将原始的基于转移奖励的 MDP 映射为等价的基于状态奖励的 MDP，同时保持累积折扣奖励的分布不变。
通过状态扩展技术定义该变换，将转移信息编码到状态空间中，确保奖励分布的保真度。
将该变换应用于确定性和随机的基于转移奖励的函数，证明其通用性。
利用变换后的 MDP，通过矩方法估计累积折扣奖励的分布，假设其近似正态。
利用变换后 MDP 的结构，通过标准动态规划或学习技术计算风险敏感度量（如风险价值）。
通过证明变换后 MDP 中累积奖励的分布与原始 MDP 中的分布一致，验证该方法的正确性。

实验结果

研究问题

RQ1将基于转移的奖励简化为基于状态的奖励，在风险敏感 MDP 中如何扭曲累积折扣奖励的分布？
RQ2能否设计一种变换，将基于转移的奖励转换为基于状态的奖励，而不会改变累积奖励的分布？
RQ3所提出的变换在确定性和随机的基于转移奖励函数下是否均能保持奖励分布？
RQ4在应用变换后，如何估计奖励分布，特别是在假设其近似正态的前提下？
RQ5该变换对无限时域 MDP 中风险敏感目标（如风险价值）的估计准确性有何影响？

主要发现

将基于转移的奖励简化为基于状态的奖励，会显著引入奖励分布估计的误差，尤其在风险敏感目标上更为明显。
所提出的变换在从基于转移的奖励函数转换为基于状态的奖励函数时，成功保持了累积折扣奖励的分布。
该变换对确定性和随机的基于转移奖励函数均有效，确保了分布的保真度。
变换后，在假设奖励分布近似正态的前提下，可实现对奖励分布的精确估计，从而支持可靠的風險敏感分析。
该方法通过在变换后的 MDP 中保持累积奖励的真实分布，实现了对风险价值及类似风险度量的正确估计。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。