[论文解读] Variance-reduced $Q$-learning is minimax optimal
提出了一个用于有限 MDP 的带 γ 折扣的方差降低的 Q-learning 变体,并证明其在折扣复杂度的对数因子内达到 minimax 最优的样本复杂度。
We introduce and analyze a form of variance-reduced $Q$-learning. For $γ$-discounted MDPs with finite state space $\mathcal{X}$ and action space $\mathcal{U}$, we prove that it yields an $ε$-accurate estimate of the optimal $Q$-function in the $\ell_\infty$-norm using $\mathcal{O} \left(\left(\frac{D}{ ε^2 (1-γ)^3} ight) \; \log \left( \frac{D}{(1-γ)} ight) ight)$ samples, where $D = |\mathcal{X}| imes |\mathcal{U}|$. This guarantee matches known minimax lower bounds up to a logarithmic factor in the discount complexity. In contrast, our past work shows that ordinary $Q$-learning has worst-case quartic scaling in the discount complexity.
研究动机与目标
- Motivate the study of variance-reduced methods in Q-learning for γ-discounted finite MDPs.
- Propose a practical variance-reduced Q-learning algorithm inspired by SVRG.
- Establish non-asymptotic, high-probability convergence guarantees.
- Show minimax-optimal sample complexity up to logarithmic factors in the discount complexity 1/(1−γ).
- Compare with prior Q-learning results and identify improvements in dependence on (1−γ).
提出的方法
- Define a variance-reduced Q-learning operator that uses a Monte Carlo approximation of the Bellman update with unbiased recentering.
- Structure the algorithm into epochs with a variance-reduced update θk+1 = (1−λk)θk + λk(bTk(θk) − bTk(θ) + eTN(θ)) where eTN(θ) is an unbiased estimate of T(θ).
- Use epoch-length K and recentering sample sizes Nm to control bias and variance, with step sizes λk = 1/(1+(1−γ)k).
- Provide RunEpoch and overall Algorithm Variance-reduced Q-learning with M epochs, each of length K and recentering samples Nm.
- Derive parameter choices: K = c1 log(8MD(1−γ)−δ)/( (1−γ)3 ), Nm = c2 4m log(8MD/δ)/( (1−γ)2 ).
- Prove geometric convergence over epochs and give explicit total-sample bounds culminating in minimax-optimal results up to logarithmic factors.]
- research_questions:[
实验结果
研究问题
- RQ1Can a simple variance-reduction extension of Q-learning achieve minimax-optimal sample complexity for estimating the optimal Q-function in ℓ∞-norm?
- RQ2How should epoch structure, recentering, and step sizes be designed to balance bias and variance in variance-reduced Q-learning?
- RQ3What are the precise non-asymptotic, high-probability guarantees (convergence rate and sample complexity) for the proposed method?
- RQ4How does the proposed method compare to existing Q-learning and Q-value iteration approaches in terms of dependence on (1−γ)?
主要发现
- The variance-reduced Q-learning algorithm achieves geometric convergence over epochs with high probability.
- The final error after M epochs satisfies ∥θM − θ∗∥∞ ≤ ∥σ(θ∗)∥∞ + ∥θ∗∥∞(1−γ)2M with probability at least 1−δ.
- The total sample complexity to achieve ϵ-accuracy is bounded by a log-factor dependent expression, improving over ordinary Q-learning and matching minimax lower bounds up to log factors.
- In the worst-case over γ-discounted MDPs with rmax-bounded rewards, the method attains the cubic 1/(1−γ)3 scaling, matching known minimax lower bounds up to logarithmic factors (Proposition 1).
- A refined analysis shows that starting from an initialization within rmax√(1−γ) of θ∗ yields minimax-optimal sample complexity (Proposition 1).
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。