QUICK REVIEW

[Paper Review] Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time

Mengdi Wang|arXiv (Cornell University)|Apr 6, 2017

Reinforcement Learning in Robotics26 references20 citations

TL;DR

This paper proposes a randomized linear programming algorithm that solves the discounted Markov decision problem (DMDP) in nearly-linear time by leveraging value-policy duality, adaptive sampling, and binary-tree data structures for efficient primal-dual updates. It achieves an $\epsilon$-optimal policy with nearly-linear runtime in the worst case and sublinear runtime when the MDP is ergodic and structured, offering a new complexity benchmark for stochastic dynamic programming.

ABSTRACT

We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted Markov decision problem. By leveraging the value-policy duality and binary-tree data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. We show that it finds an $ε$-optimal policy using nearly-linear run time in the worst case. When the Markov decision process is ergodic and specified in some special data formats, the algorithm finds an $ε$-optimal policy using run time linear in the total number of state-action pairs, which is sublinear in the input size. These results provide a new venue and complexity benchmarks for solving stochastic dynamic programs.

Motivation & Objective

To develop a randomized algorithm that approximates the optimal policy of the discounted Markov decision problem (DMDP) with improved runtime complexity.
To reduce dependence on state and action space sizes $|\mathcal{S}|$ and $|\mathcal{A}|$ by trading off exact optimality for computational efficiency.
To establish new complexity benchmarks for solving stochastic dynamic programs by achieving nearly-linear or sublinear runtime in specific structural cases.
To exploit value-policy duality and information projection via exponentiated updates to enable efficient policy learning.

Proposed method

Formulates the DMDP as a stochastic saddle point problem using value-policy duality and specially constructed constraints and weight vectors.
Employs adaptive action sampling based on the current randomized policy to reduce computational overhead.
Uses exponentiated primal-dual updates with information projection onto a constraint set to maintain policy feasibility and promote convergence.
Leverages binary-tree data structures to simulate state transitions and perform policy updates in $\tilde{\mathcal{O}}(1)$ time per update.
Introduces a Lyapunov function $\mathcal{E}^t$ combining KL divergence and value function error to analyze convergence.
Derives a recursive expectation bound (Equation 14) showing that $\mathcal{E}^{t+1}$ decreases in expectation when the duality gap $\mathcal{G}^t$ is large.

Experimental results

Research questions

RQ1Can a randomized algorithm achieve nearly-linear runtime for solving the discounted MDP while maintaining $\epsilon$-optimality?
RQ2Under what structural conditions (e.g., ergodicity, data format) can the algorithm achieve sublinear runtime in the input size?
RQ3How does adaptive sampling and binary-tree data structure contribute to reducing runtime complexity in policy update steps?
RQ4What is the theoretical convergence rate of the proposed primal-dual method in terms of the duality gap $\mathcal{G}^t$?
RQ5Can the value-policy duality formulation with information projection lead to stable and efficient policy updates?

Key findings

The algorithm finds an $\epsilon$-optimal policy using nearly-linear runtime in the worst case, i.e., $\tilde{\mathcal{O}}(|\mathcal{S}|^2|\mathcal{A}|)$ operations, hiding polylogarithmic factors.
When the MDP is ergodic and specified in special data formats, the runtime becomes linear in the total number of state-action pairs, which is sublinear in the input size $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$.
The expected duality gap $\mathcal{G}^t$ decays at a rate of $\mathcal{O}\left(\frac{1}{\sqrt{T}}\right)$, ensuring convergence to an $\epsilon$-optimal policy.
The algorithm achieves $\epsilon$-optimality with a step size $\beta = (1-\gamma)\sqrt{\frac{\log|\mathcal{S}||\mathcal{A}|+1}{2|\mathcal{S}||\mathcal{A}|T}}$, balancing convergence and stability.
The Lyapunov function $\mathcal{E}^t$ ensures monotonic decrease in expectation, with $\mathcal{E}^1 \leq \log(|\mathcal{S}||\mathcal{A}|) + 1$, enabling tight convergence bounds.
The use of binary trees enables $\tilde{\mathcal{O}}(1)$-time policy updates, making the algorithm scalable to large state-action spaces.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.