Skip to main content
QUICK REVIEW

[Paper Review] Scalable Bilinear $π$ Learning Using State and Action Features

Yichen Chen, Lihong Li|arXiv (Cornell University)|Apr 27, 2018
Reinforcement Learning in Robotics29 references22 citations
TL;DR

This paper proposes bilinear π learning, a scalable, model-free reinforcement learning algorithm that uses state and action features to approximate value functions and state-action distributions via bilinear models. It achieves sample-efficient, online policy optimization with linear sample complexity in the feature dimension, independent of MDP size, through a primal-dual stochastic optimization framework solving a Bellman saddle-point problem.

ABSTRACT

Approximate linear programming (ALP) represents one of the major algorithmic families to solve large-scale Markov decision processes (MDP). In this work, we study a primal-dual formulation of the ALP, and develop a scalable, model-free algorithm called bilinear $π$ learning for reinforcement learning when a sampling oracle is provided. This algorithm enjoys a number of advantages. First, it adopts (bi)linear models to represent the high-dimensional value function and state-action distributions, using given state and action features. Its run-time complexity depends on the number of features, not the size of the underlying MDPs. Second, it operates in a fully online fashion without having to store any sample, thus having minimal memory footprint. Third, we prove that it is sample-efficient, solving for the optimal policy to high precision with a sample complexity linear in the dimension of the parameter space.

Motivation & Objective

  • Develop a scalable, model-free RL algorithm for large MDPs with huge state and action spaces.
  • Enable efficient policy optimization using only a sampling oracle and given state and action features.
  • Achieve low computational and memory complexity independent of MDP size by leveraging feature-based compact representations.
  • Provide strong theoretical guarantees on sample efficiency and convergence for policy learning in large-scale MDPs.

Proposed method

  • Formulates policy optimization as a primal-dual saddle-point problem based on the Bellman equation.
  • Uses bilinear models to represent the value function and state-action distribution using state features φ(s) ∈ ℝ^D and action features ψ(a) ∈ ℝ^U.
  • Employs stochastic primal-dual updates that process one transition at a time, enabling online learning with minimal memory.
  • Introduces a compact parameterization where the state-action distribution is modeled as a bilinear function of state and action features.
  • Derives convergence guarantees by analyzing the coupled primal-dual dynamics in the context of approximate linear programming (ALP).
  • Leverages strong duality to couple value and policy updates, ensuring stable and efficient optimization.

Experimental results

Research questions

  • RQ1Can a primal-dual formulation of policy optimization be made scalable and sample-efficient using feature-based compact representations?
  • RQ2How can bilinear models of state and action features be used to approximate high-dimensional value functions and state-action distributions?
  • RQ3What is the sample complexity of learning an ϵ-optimal policy using this approach, and how does it scale with the feature dimensions?
  • RQ4Can the algorithm maintain low computational and memory complexity while achieving high-precision policy learning in large MDPs?
  • RQ5How does the approximation error in value function and state-action distribution models affect the optimality gap of the learned policy?

Key findings

  • The bilinear π learning algorithm achieves a sample complexity of O(DU / ϵ²) to find an ϵ-optimal policy, linear in the feature dimensions D and U.
  • The algorithm’s runtime and memory complexity depend only on D and U, not on |S| or |A|, enabling scalability to large MDPs.
  • The method is fully online and requires no storage of past samples, resulting in minimal memory footprint.
  • The difference between the solution of the Bellman saddle-point problem and the true Bellman equation is bounded by ℓ∞ and ℓ1 errors of the function approximators.
  • In the realizable case (zero approximation error), solving the saddle-point problem is equivalent to solving the original Bellman equation.
  • The algorithm ensures provably stable convergence with a finite-sample rate, unlike many ADP methods that may diverge or oscillate.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.