QUICK REVIEW

[论文解读] If MaxEnt RL is the Answer, What is the Question?

Benjamin Eysenbach, Sergey Levine|arXiv (Cornell University)|Oct 4, 2019

Reinforcement Learning in Robotics参考文献 60被引用 32

一句话总结

本文表明，最大熵强化学习（MaxEnt RL）在奖励不确定性环境下（如元-POMDP 和对抗性奖励设置）能最优地解决控制问题。它证明了 MaxEnt RL 等价于元-POMDP 中的后悔最小化以及鲁棒奖励控制，解释了其在随机、不确定环境中的经验成功。

ABSTRACT

Experimentally, it has been observed that humans and animals often make decisions that do not maximize their expected utility, but rather choose outcomes randomly, with probability proportional to expected utility. Probability matching, as this strategy is called, is equivalent to maximum entropy reinforcement learning (MaxEnt RL). However, MaxEnt RL does not optimize expected utility. In this paper, we formally show that MaxEnt RL does optimally solve certain classes of control problems with variability in the reward function. In particular, we show (1) that MaxEnt RL can be used to solve a certain class of POMDPs, and (2) that MaxEnt RL is equivalent to a two-player game where an adversary chooses the reward function. These results suggest a deeper connection between MaxEnt RL, robust control, and POMDPs, and provide insight for the types of problems for which we might expect MaxEnt RL to produce effective solutions. Specifically, our results suggest that domains with uncertainty in the task goal may be especially well-suited for MaxEnt RL methods.

研究动机与目标

识别 MaxEnt RL 作为最优解的底层控制问题。
解释尽管 MaxEnt RL 优化的目标与标准强化学习不同，为何其在实践中表现良好。
形式化奖励变化情境，其中 MaxEnt RL 提供最优的随机策略。
建立 MaxEnt RL 与鲁棒控制及部分可观测决策问题之间的联系。
表明 MaxEnt RL 在涉及奖励不确定性的问题中（如对抗性与元学习设置）自然涌现。

提出的方法

将 MaxEnt RL 形式化为在元-POMDP 中最小化期望后悔，其中奖励函数未被观测且在各 episode 间变化。
将对手选择奖励函数建模为 MDP 分布，MaxEnt RL 解决由此产生的鲁棒控制问题。
利用最大熵原理，在不确定性下推导出唯一且最优的策略，确保对最坏情况奖励实现的鲁棒性。
应用变分推断与边缘分布匹配，表明 MaxEnt RL 与轨迹上策略混合的等价性。
采用凸对偶与 KKT 条件，表明 MaxEnt RL 解决了一个等价于鲁棒奖励控制的正则化强化学习问题。
证明 MaxEnt RL 可简化为求解一个具有奖励函数凸组合的标准强化学习问题，其最优策略由熵正则化保证唯一性。

实验结果

研究问题

RQ1给定 MaxEnt RL 并非最大化期望效用，它最优求解哪些控制问题？
RQ2在何种情境下，MaxEnt RL 在奖励函数不确定性下成为最优策略？
RQ3MaxEnt RL 如何与鲁棒控制及部分可观测决策问题相关联？
RQ4为何 MaxEnt RL 尽管目标不同，却在实践中优于标准强化学习？
RQ5能否在熵最大化之外，正式证明 MaxEnt RL 解决了一个明确定义的问题？

主要发现

MaxEnt RL 等价于元-POMDP 中的后悔最小化，其中奖励函数未被观测且在各 episode 间变化。
MaxEnt RL 为鲁棒奖励控制提供了最优解，其中对手从一组可能性中选择奖励函数。
由于熵正则化，MaxEnt RL 下的最优策略是唯一的，这确保了鲁棒性并防止退化解。
MaxEnt RL 可简化为求解具有奖励函数凸组合的标准强化学习问题，其最优策略为唯一解。
该方法适用于一般鲁棒控制问题，不仅限于奖励变化，也包括动力学与奖励的双重不确定性。
该理论框架解释了 MaxEnt RL 在真实世界与模拟控制任务中的经验成功，尤其在不确定或对抗性环境中。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。