QUICK REVIEW

[Paper Review] Sparse Q-learning with Mirror Descent

Sridhar Mahadevan, Bo Liu|arXiv (Cornell University)|Oct 16, 2012

Model Reduction and Neural Networks31 references21 citations

TL;DR

This paper introduces a novel sparse Q-learning algorithm using mirror descent, a proximal optimization method based on Bregman divergences, to efficiently solve high-dimensional reinforcement learning problems. By leveraging l1-regularization through Bregman divergences such as p-norms and Mahalanobis distance, the method achieves sparse policy representations with significantly reduced computational cost compared to prior second-order methods.

ABSTRACT

This paper explores a new framework for reinforcement learning based on online convex optimization, in particular mirror descent and related algorithms. Mirror descent can be viewed as an enhanced gradient method, particularly suited to minimization of convex functions in highdimensional spaces. Unlike traditional gradient methods, mirror descent undertakes gradient updates of weights in both the dual space and primal space, which are linked together using a Legendre transform. Mirror descent can be viewed as a proximal algorithm where the distance generating function used is a Bregman divergence. A new class of proximal-gradient based temporal-difference (TD) methods are presented based on different Bregman divergences, which are more powerful than regular TD learning. Examples of Bregman divergences that are studied include p-norm functions, and Mahalanobis distance based on the covariance of sample gradients. A new family of sparse mirror-descent reinforcement learning methods are proposed, which are able to find sparse fixed points of an l1-regularized Bellman equation at significantly less computational cost than previous methods based on second-order matrix methods. An experimental study of mirror-descent reinforcement learning is presented using discrete and continuous Markov decision processes.

Motivation & Objective

To address the challenge of high-dimensional value function approximation in reinforcement learning by introducing a sparsity-inducing optimization framework.
To reduce the computational burden of existing l1-regularized Q-learning methods that rely on expensive second-order matrix updates.
To develop a scalable, proximal-gradient-based temporal-difference learning method grounded in online convex optimization.
To enable efficient learning in both discrete and continuous Markov decision processes using mirror descent with adaptive Bregman divergences.
To demonstrate that sparse fixed points of the l1-regularized Bellman equation can be found more efficiently using first-order mirror descent than second-order alternatives.

Proposed method

The method employs mirror descent as a proximal algorithm using a Bregman divergence as the distance-generating function.
It performs gradient updates in both primal and dual spaces linked by a Legendre transform, enabling efficient optimization in high-dimensional spaces.
Different Bregman divergences are explored, including p-norms and Mahalanobis distance based on sample gradient covariance.
The approach formulates a proximal-gradient TD method that regularizes the Q-value update with an l1 penalty, promoting sparsity.
The algorithm iteratively updates Q-values using mirror descent steps that maintain sparsity while minimizing the regularized Bellman error.
The method is applied to both discrete and continuous MDPs, demonstrating scalability and robustness across environments.

Experimental results

Research questions

RQ1Can mirror descent with Bregman divergences be effectively used to regularize Q-learning and induce sparsity in value function representations?
RQ2How does the computational cost of mirror-descent-based Q-learning compare to second-order methods for l1-regularized Q-learning?
RQ3Does the use of Mahalanobis distance as a Bregman divergence improve convergence and sparsity in high-dimensional MDPs?
RQ4Can the proposed method achieve sparse fixed points of the l1-regularized Bellman equation more efficiently than existing approaches?
RQ5How does the performance of sparse mirror-descent Q-learning scale across discrete and continuous control tasks?

Key findings

The proposed mirror-descent Q-learning method achieves sparse fixed points of the l1-regularized Bellman equation at a significantly lower computational cost than prior second-order matrix methods.
Using Mahalanobis distance as a Bregman divergence leads to faster convergence and improved sparsity in high-dimensional value function approximation.
The method demonstrates strong performance on both discrete and continuous Markov decision processes, validating its scalability.
The use of p-norm Bregman divergences enables effective regularization and sparsity control in the Q-value function.
Empirical results show that the algorithm maintains high sample efficiency and robustness across diverse RL environments.
The framework provides a computationally efficient alternative to second-order l1-regularized Q-learning, making sparse value function learning more practical.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.