QUICK REVIEW

[论文解读] Tsallis Reinforcement Learning: A Unified Framework for Maximum Entropy Reinforcement Learning

Kyungjae Lee, Sungyub Kim|arXiv (Cornell University)|Jan 31, 2019

stochastic dynamics and bifurcation参考文献 27被引用 18

一句话总结

本文提出了Tsallis强化学习，这是一种通过可调熵指数q使用Tsallis熵来统一广义最大熵强化学习的框架。通过控制q，该方法在基于模型无关的演员-critic算法中动态平衡探索与利用，实现了在MuJoCo环境中的最先进性能，并具有理论收敛保证。

ABSTRACT

In this paper, we present a new class of Markov decision processes (MDPs), called Tsallis MDPs, with Tsallis entropy maximization, which generalizes existing maximum entropy reinforcement learning (RL). A Tsallis MDP provides a unified framework for the original RL problem and RL with various types of entropy, including the well-known standard Shannon-Gibbs (SG) entropy, using an additional real-valued parameter, called an entropic index. By controlling the entropic index, we can generate various types of entropy, including the SG entropy, and a different entropy results in a different class of the optimal policy in Tsallis MDPs. We also provide a full mathematical analysis of Tsallis MDPs, including the optimality condition, performance error bounds, and convergence. Our theoretical result enables us to use any positive entropic index in RL. To handle complex and large-scale problems, we propose a model-free actor-critic RL method using Tsallis entropy maximization. We evaluate the regularization effect of the Tsallis entropy with various values of entropic indices and show that the entropic index controls the exploration tendency of the proposed method. For a different type of RL problems, we find that a different value of the entropic index is desirable. The proposed method is evaluated using the MuJoCo simulator and achieves the state-of-the-art performance.

研究动机与目标

将强化学习中各种熵正则化形式统一到一个统一框架下。
分析Tsallis MDP的理论性质，包括最优性条件、误差界和任意正熵指数下的收敛性。
为大规模连续控制问题开发基于Tsallis熵的无模型演员-critic算法。
通过实验验证熵指数q对探索行为的控制作用，并提升样本效率。
证明不同q值在不同强化学习任务中为最优，支持任务特定的超参数调优。

提出的方法

提出Tsallis MDP，一种新型的马尔可夫决策过程，通过可调熵指数q引入Tsallis熵最大化。
推导Tsallis贝尔曼最优方程，并为所有正q值下的Tsallis策略和价值迭代建立最优性和收敛性。
开发基于重参数化梯度的Tsallis演员-critic（TAC）算法，适用于连续动作空间，采用基于q-对数的策略梯度更新。
实现一种数值稳定技术，将策略密度上限设为10^(8/(q-1))，以防止梯度爆炸，尤其在q ≥ 2时有效。
使用经验回放缓冲区，结合经验回放和软更新率τ的目标网络更新，确保训练稳定。
对有界连续动作应用tanh挤压函数，并计算q-对数似然以用于策略梯度。

实验结果

研究问题

RQ1Tsallis熵结合可调熵指数q是否能统一强化学习中各种熵正则化形式，包括Shannon-Gibbs熵和稀疏Tsallis熵？
RQ2熵指数q的取值如何影响策略优化中的探索-利用权衡？
RQ3所提出的Tsallis MDP框架是否在所有正q值下均保持理论收敛性和最优性保证？
RQ4Tsallis演员-critic方法是否能在MuJoCo等连续控制基准测试中实现最先进性能？
RQ5是否存在任务特定的最优q值，可提升样本效率和最终性能？

主要发现

熵指数q控制探索行为：较低的q值（如1.2）倾向于稀疏性和贪婪行为，而较高的q值（如2.0）则促进更广泛的探索。
在Hopper-v2和Swimmer-v2任务中，q = 2.0的性能略优于其他值，表明在这些任务中更高探索具有优势。
在HalfCheetah-v2和Ant-v2任务中，q = 1.5提供了最佳性能，表明在运动类任务中平衡的探索-利用权衡最为理想。
在Pusher-v2和Humanoid-v2任务中，q = 1.2表现最佳，表明在复杂操作和高维控制任务中，更稀疏的策略更为有效。
Tsallis演员-critic方法在所有测试的MuJoCo环境中均实现了最先进性能，优于标准SAC和其他熵正则化基线方法。
通过将密度上限设为10^(8/(q-1))实现的数值稳定化方法，有效防止了梯度爆炸，尤其在q ≥ 2时，确保了训练的稳定性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。