QUICK REVIEW

[Paper Review] Regularized Off-Policy TD-Learning

Bo 博 Liu 刘, Sridhar Mahadevan|arXiv (Cornell University)|Jun 6, 2020

Stochastic Gradient Optimization Techniques21 references19 citations

TL;DR

This paper proposes RO-TD, a novel $l_1$-regularized off-policy temporal difference learning algorithm that achieves sparse value function representation with low computational cost. By formulating the off-policy TD problem as a convex-concave saddle-point stochastic optimization problem, RO-TD enables first-order solvers and effective feature selection while maintaining off-policy convergence.

ABSTRACT

We present a novel $l_1$ regularized off-policy convergent TD-learning method (termed RO-TD), which is able to learn sparse representations of value functions with low computational complexity. The algorithmic framework underlying RO-TD integrates two key ideas: off-policy convergent gradient TD methods, such as TDC, and a convex-concave saddle-point formulation of non-smooth convex optimization, which enables first-order solvers and feature selection using online convex regularization. A detailed theoretical and experimental analysis of RO-TD is presented. A variety of experiments are presented to illustrate the off-policy convergence, sparse feature selection capability and low computational cost of the RO-TD algorithm.

Motivation & Objective

Address the challenge of learning sparse value function representations in off-policy temporal difference learning with low computational cost.
Develop a convergent off-policy RL algorithm that integrates $l_1$ regularization for feature selection without relying on second-order methods.
Bridge the gap between off-policy convergence and sparsity in value function approximation using first-order optimization techniques.
Enable scalable reinforcement learning in high-dimensional feature spaces by combining TDC-style off-policy learning with online convex regularization.
Provide a unified framework for regularized, convergent off-policy RL using convex optimization and stochastic first-order methods.

Proposed method

Reformulate the off-policy TD learning problem as a convex-concave saddle-point stochastic approximation problem using the TDC algorithm's linear equation formulation.
Apply a proximal gradient method to solve the resulting non-smooth convex optimization problem, enabling $l_1$ regularization and feature selection.
Use online convex regularization to incrementally update the value function estimate with sparse feature representation.
Integrate the TDC algorithm’s two-time-scale update rule with $l_1$ regularization via a dual formulation, ensuring off-policy convergence.
Leverage the saddle-point formulation to enable first-order solvers that scale linearly with the number of features and samples.
Tune regularization parameters $\rho_1$ and $\rho_2$ to balance sparsity and convergence, with $\rho_2$ controlling the influence of the TDC correction term.

Experimental results

Research questions

RQ1Can $l_1$ regularization be effectively integrated into off-policy TD learning while preserving convergence?
RQ2Does the proposed RO-TD algorithm achieve sparse feature selection without sacrificing sample efficiency or computational scalability?
RQ3How does the saddle-point formulation of the optimization problem enable first-order, low-complexity learning in off-policy settings?
RQ4What is the empirical performance of RO-TD in comparison to existing methods like TDC, LARS-TD, and $l_2$ LSTD in terms of convergence and sparsity?
RQ5Can RO-TD outperform existing methods in high-dimensional, under-actuated control tasks with noisy or irrelevant features?

Key findings

RO-TD successfully performs feature selection in the grid world task, achieving 100% success rate in 20 runs, while TDC and TD failed entirely.
In the triple-link inverted pendulum task, RO-GQ($\lambda$) required only 6.9 ± 4.82 episodes on average to succeed, outperforming GQ($\lambda$) (11.3 ± 9.58 episodes) and LARS-TD, which failed due to poor sample quality.
RO-TD achieved a mean of 147.40 ± 13.31 steps to convergence in the grid world, slightly higher than LARS-TD (142.25 ± 9.74), but with guaranteed off-policy convergence and sparsity.
The algorithm’s computational complexity is $O(Nd)$, significantly lower than LARS-TD’s $O(Ndp^3)$, especially when $p$ is sublinear in $d$.
Tuning $\rho_2$ allows interpolation between TD and TDC behavior, where large $\rho_2$ reduces the TDC correction term and makes updates more similar to standard TD.
RO-GQ($\lambda$) outperforms GQ($\lambda$) in both experiments on the triple-link pendulum, demonstrating robustness and scalability in high-dimensional, nonlinear domains.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.