QUICK REVIEW

[论文解读] A Two-Timescale Framework for Bilevel Optimization: Complexity Analysis and Application to Actor-Critic

Mingyi Hong, Hoi To Wai|arXiv (Cornell University)|Jul 10, 2020

Adaptive Dynamic Programming Control参考文献 62被引用 52

一句话总结

引入一个两-timescale 随机逼近（TTSA）算法用于具有无约束、强凸内部问题和光滑外部目标的双层优化，并推导收敛速率；将 TTSA 应用到具有两尺度的自然 actor-critic 策略优化中，给出速率。

ABSTRACT

This paper analyzes a two-timescale stochastic algorithm framework for bilevel optimization. Bilevel optimization is a class of problems which exhibit a two-level structure, and its goal is to minimize an outer objective function with variables which are constrained to be the optimal solution to an (inner) optimization problem. We consider the case when the inner problem is unconstrained and strongly convex, while the outer problem is constrained and has a smooth objective function. We propose a two-timescale stochastic approximation (TTSA) algorithm for tackling such a bilevel problem. In the algorithm, a stochastic gradient update with a larger step size is used for the inner problem, while a projected stochastic gradient update with a smaller step size is used for the outer problem. We analyze the convergence rates for the TTSA algorithm under various settings: when the outer problem is strongly convex (resp.~weakly convex), the TTSA algorithm finds an $\mathcal{O}(K^{-2/3})$-optimal (resp.~$\mathcal{O}(K^{-2/5})$-stationary) solution, where $K$ is the total iteration number. As an application, we show that a two-timescale natural actor-critic proximal policy optimization algorithm can be viewed as a special case of our TTSA framework. Importantly, the natural actor-critic algorithm is shown to converge at a rate of $\mathcal{O}(K^{-1/4})$ in terms of the gap in expected discounted reward compared to a global optimal policy.

研究动机与目标

动机并形式化描述内部问题为强凸、外部问题为光滑的双层优化。
提出一个单循环 TTSA 算法，在不同时间尺度上更新内部变量和外部变量。
在外部目标分别为强凸、凸和弱凸的情形下，建立 TTSA 的收敛速率。
利用隐式求导通过内部解为外部目标构造代理梯度。
展示在强化学习中通过一个两尺度的自然 actor-critic PPO 框架的应用。

提出的方法

建立 TTSA，使 y 以较大的步长更新、x 以较小的步长更新，确保 y 随着 x 的变化而跟踪 y*(x)。
基于 y 的外部目标梯度代理，具体为 overline{∇}_x f(x,y) = ∇_x f(x,y) − ∇_{xy}^2 g(x,y) [∇_{yy}^2 g(x,y)]^{-1} ∇_y f(x,y)。
提供带有受控偏差与方差的随机梯度和 Hessian/Jacobian 的估计（假设 3、7）。
提出一个由随机样本构建的梯度估计量 h_f^k，用以在利用强凸内部问题的同时近似 overline{∇}_x f。
分析耦合不等式与跟踪误差 Δ_y^k，以建立外部与内部递推的收敛速率。

实验结果

研究问题

RQ1单循环 TTSA 算法是否能在内部问题为强凸、外部目标为光滑的双层问题上实现收敛？
RQ2在强凸外部、凸外部和弱凸外部设定下，TTSA 的收敛速率是多少？
RQ3两尺度动态如何影响跟踪误差及实际收敛性？
RQ4TTSA 是否能在强化学习框架（如 actor-critic 方法）中有效应用？
RQ5哪些代理梯度形式能实现 TTSA 外部目标梯度的实际计算？

主要发现

TTSA 在外部目标为强凸且步长衰减的情况下达到 O(K_max^{-2/3})-最优性。
TTSA 在外部目标为弱凸时达到 O(K_max^{-2/5})-稳态。
对于凸外部目标，TTSA 达到 O(K_max^{-1/4})-外部速率和 O(K_max^{-1/2})-内部速率，且步长选择合适。
基于隐式微分的代理梯度实现了近似无偏估计，并具备受控偏差/方差。
应用于两尺度的自然 actor-critic PPO 在相对于最优策略的遗憾上显示收敛速率为 O(K^{-1/4})。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。