[Paper Review] Randomized Linear Programming Solves the Discounted Markov Decision Problem In Nearly-Linear (Sometimes Sublinear) Running Time
This paper proposes a randomized linear programming algorithm that solves the discounted Markov decision problem (DMDP) in nearly-linear time by leveraging value-policy duality, adaptive sampling, and binary-tree data structures for efficient primal-dual updates. It achieves an $ \epsilon$-optimal policy with nearly-linear runtime in the worst case and sublinear runtime when the MDP is ergodic and structured, offering a new complexity benchmark for stochastic dynamic programming.
We propose a novel randomized linear programming algorithm for approximating the optimal policy of the discounted Markov decision problem. By leveraging the value-policy duality and binary-tree data structures, the algorithm adaptively samples state-action-state transitions and makes exponentiated primal-dual updates. We show that it finds an $ε$-optimal policy using nearly-linear run time in the worst case. When the Markov decision process is ergodic and specified in some special data formats, the algorithm finds an $ε$-optimal policy using run time linear in the total number of state-action pairs, which is sublinear in the input size. These results provide a new venue and complexity benchmarks for solving stochastic dynamic programs.
Motivation & Objective
- To develop a randomized algorithm that approximates the optimal policy of the discounted Markov decision problem (DMDP) with improved runtime complexity.
- To reduce dependence on state and action space sizes $|\mathcal{S}|$ and $|\mathcal{A}|$ by trading off exact optimality for computational efficiency.
- To establish new complexity benchmarks for solving stochastic dynamic programs by achieving nearly-linear or sublinear runtime in specific structural cases.
- To exploit value-policy duality and information projection via exponentiated updates to enable efficient policy learning.
Proposed method
- Formulates the DMDP as a stochastic saddle point problem using value-policy duality and specially constructed constraints and weight vectors.
- Employs adaptive action sampling based on the current randomized policy to reduce computational overhead.
- Uses exponentiated primal-dual updates with information projection onto a constraint set to maintain policy feasibility and promote convergence.
- Leverages binary-tree data structures to simulate state transitions and perform policy updates in $\tilde{\mathcal{O}}(1)$ time per update.
- Introduces a Lyapunov function $\mathcal{E}^t$ combining KL divergence and value function error to analyze convergence.
- Derives a recursive expectation bound (Equation 14) showing that $\mathcal{E}^{t+1}$ decreases in expectation when the duality gap $\mathcal{G}^t$ is large.
Experimental results
Research questions
- RQ1Can a randomized algorithm achieve nearly-linear runtime for solving the discounted MDP while maintaining $\epsilon$-optimality?
- RQ2Under what structural conditions (e.g., ergodicity, data format) can the algorithm achieve sublinear runtime in the input size?
- RQ3How does adaptive sampling and binary-tree data structure contribute to reducing runtime complexity in policy update steps?
- RQ4What is the theoretical convergence rate of the proposed primal-dual method in terms of the duality gap $\mathcal{G}^t$?
- RQ5Can the value-policy duality formulation with information projection lead to stable and efficient policy updates?
Key findings
- The algorithm finds an $\epsilon$-optimal policy using nearly-linear runtime in the worst case, i.e., $\tilde{\mathcal{O}}(|\mathcal{S}|^2|\mathcal{A}|)$ operations, hiding polylogarithmic factors.
- When the MDP is ergodic and specified in special data formats, the runtime becomes linear in the total number of state-action pairs, which is sublinear in the input size $\mathcal{O}(|\mathcal{S}|^2|\mathcal{A}|)$.
- The expected duality gap $\mathcal{G}^t$ decays at a rate of $\mathcal{O}\left(\frac{1}{\sqrt{T}}\right)$, ensuring convergence to an $\epsilon$-optimal policy.
- The algorithm achieves $\epsilon$-optimality with a step size $\beta = (1-\gamma)\sqrt{\frac{\log|\mathcal{S}||\mathcal{A}|+1}{2|\mathcal{S}||\mathcal{A}|T}}$, balancing convergence and stability.
- The Lyapunov function $\mathcal{E}^t$ ensures monotonic decrease in expectation, with $\mathcal{E}^1 \leq \log(|\mathcal{S}||\mathcal{A}|) + 1$, enabling tight convergence bounds.
- The use of binary trees enables $\tilde{\mathcal{O}}(1)$-time policy updates, making the algorithm scalable to large state-action spaces.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.