QUICK REVIEW

[论文解读] Sparse Q-learning with Mirror Descent

Sridhar Mahadevan, Bo Liu|arXiv (Cornell University)|Oct 16, 2012

Model Reduction and Neural Networks参考文献 31被引用 21

一句话总结

本文提出了一种新颖的稀疏Q-learning算法，采用镜像下降法（一种基于Bregman散度的近端优化方法），以高效求解高维强化学习问题。通过利用基于Bregman散度（如p-范数和马氏距离）的l1正则化，该方法实现了稀疏策略表示，与先前的二阶方法相比，计算成本显著降低。

ABSTRACT

This paper explores a new framework for reinforcement learning based on online convex optimization, in particular mirror descent and related algorithms. Mirror descent can be viewed as an enhanced gradient method, particularly suited to minimization of convex functions in highdimensional spaces. Unlike traditional gradient methods, mirror descent undertakes gradient updates of weights in both the dual space and primal space, which are linked together using a Legendre transform. Mirror descent can be viewed as a proximal algorithm where the distance generating function used is a Bregman divergence. A new class of proximal-gradient based temporal-difference (TD) methods are presented based on different Bregman divergences, which are more powerful than regular TD learning. Examples of Bregman divergences that are studied include p-norm functions, and Mahalanobis distance based on the covariance of sample gradients. A new family of sparse mirror-descent reinforcement learning methods are proposed, which are able to find sparse fixed points of an l1-regularized Bellman equation at significantly less computational cost than previous methods based on second-order matrix methods. An experimental study of mirror-descent reinforcement learning is presented using discrete and continuous Markov decision processes.

研究动机与目标

为解决强化学习中高维值函数逼近的挑战，提出一种诱导稀疏性的优化框架。
降低现有l1正则化Q-learning方法的计算负担，这些方法依赖于昂贵的二阶矩阵更新。
开发一种基于在线凸优化的可扩展、近端梯度型时序差分学习方法。
通过使用自适应Bregman散度的镜像下降法，在离散和连续马尔可夫决策过程（MDP）中实现高效学习。
证明相较于二阶方法，使用一阶镜像下降法可更高效地找到l1正则化贝尔曼方程的稀疏不动点。

提出的方法

该方法将镜像下降作为近端算法，使用Bregman散度作为距离生成函数。
在对偶空间与原空间之间通过Legendre变换关联，执行梯度更新，从而在高维空间中实现高效优化。
探索了多种Bregman散度，包括p-范数以及基于样本梯度协方差的马氏距离。
该方法构建了一种近端梯度TD方法，通过l1惩罚正则化Q值更新，以促进稀疏性。
算法通过保持稀疏性的同时最小化正则化贝尔曼误差，迭代地使用镜像下降步长更新Q值。
该方法被应用于离散和连续MDP，展示了在各类环境中的可扩展性和鲁棒性。

实验结果

研究问题

RQ1能否有效利用基于Bregman散度的镜像下降法来正则化Q-learning，并在值函数表示中诱导稀疏性？
RQ2基于镜像下降的Q-learning与l1正则化Q-learning的二阶方法相比，计算成本如何？
RQ3将马氏距离作为Bregman散度使用，是否能提升高维MDP中值函数逼近的收敛速度与稀疏性？
RQ4所提出的方法是否能比现有方法更高效地找到l1正则化贝尔曼方程的稀疏不动点？
RQ5稀疏镜像下降Q-learning在离散与连续控制任务中的性能扩展性如何？

主要发现

所提出的镜像下降Q-learning方法在计算成本上显著低于先前的二阶矩阵方法，能够以更低开销实现l1正则化贝尔曼方程的稀疏不动点。
使用马氏距离作为Bregman散度可加快收敛速度，并在高维值函数逼近中提升稀疏性。
该方法在离散和连续马尔可夫决策过程（MDP）中均表现出色，验证了其可扩展性。
使用p-范数Bregman散度可有效实现Q值函数的正则化与稀疏性控制。
实验结果表明，该算法在多种强化学习环境中均保持了高样本效率和鲁棒性。
该框架为二阶l1正则化Q-learning提供了一种计算高效的替代方案，使稀疏值函数学习更具实用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。