[Paper Review] Neural Policy Gradient Methods: Global Optimality and Rates of Convergence
The paper proves global optimality and sublinear convergence rates for neural policy gradient methods in overparameterized two-layer networks, highlighting the importance of compatibility between actor and critic.
Policy gradient methods with actor-critic schemes demonstrate tremendous empirical successes, especially when the actors and critics are parameterized by neural networks. However, it remains less clear whether such "neural" policy gradient methods converge to globally optimal policies and whether they even converge at all. We answer both the questions affirmatively in the overparameterized regime. In detail, we prove that neural natural policy gradient converges to a globally optimal policy at a sublinear rate. Also, we show that neural vanilla policy gradient converges sublinearly to a stationary point. Meanwhile, by relating the suboptimality of the stationary points to the representation power of neural actor and critic classes, we prove the global optimality of all stationary points under mild regularity conditions. Particularly, we show that a key to the global optimality and convergence is the "compatibility" between the actor and critic, which is ensured by sharing neural architectures and random initializations across the actor and critic. To the best of our knowledge, our analysis establishes the first global optimality and convergence guarantees for neural policy gradient methods.
Motivation & Objective
- Motivate understanding of theoretical guarantees for neural policy gradient methods in actor-critic settings.
- Analyze convergence and optimality under overparameterization with shared architectures.
- Establish rates of convergence for vanilla and natural policy gradient methods.
- Show the role of compatibility between actor and critic via shared initializations.
Proposed method
- Represent policy as a two-layer neural network with ReLU activations and a softmax over actions (energy-based form).
- Use TD(0) with independent sampling for the critic to estimate policy gradients.
- Analyze two settings: vanilla policy gradient (gradient ascent) and natural policy gradient (Fisher information-based update).
- Prove 1/√T convergence rate in the expected squared norm of the policy gradient for vanilla policy gradient.
- Prove 1/√T convergence rate to a globally optimal policy for neural natural policy gradient under KL regularization.
Experimental results
Research questions
- RQ1Do neural policy gradient methods converge to globally optimal policies under overparameterization?
- RQ2What are the convergence rates for neural policy gradient and neural natural policy gradient in actor-critic settings?
- RQ3How does compatibility between actor and critic (shared architecture and initialization) affect convergence and optimality?
- RQ4Can stationary points of neural policy gradient be globally optimal under mild regularity conditions?
Key findings
- Neural vanilla policy gradient converges to a stationary point at a 1/√T rate in the squared gradient norm.
- Neural natural policy gradient converges to a globally optimal policy at a 1/√T rate in the total reward.
- Global optimality of all stationary points holds under mild regularity conditions and representation power of neural actor/critic.
- Global guarantees rely on a compatibility notion between actor and critic achieved via shared architectures and random initializations.
- The analysis covers overparameterized two-layer networks with TD(0) critic in an independent-sampling setting.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.