QUICK REVIEW

[Paper Review] Approximate Inference and Stochastic Optimal Control

Konrad Rawlik, Marc Toussaint|arXiv (Cornell University)|Sep 20, 2010

Reinforcement Learning in Robotics33 references20 citations

TL;DR

This paper reformulates stochastic optimal control as an approximate inference problem, enabling a novel class of iterative, model-free, off-policy reinforcement learning algorithms. By leveraging a natural relaxation of the dual formulation, the method achieves convergence to near-optimal policies in both discrete and continuous control tasks, including a linear quadratic Gaussian (LQG) pendulum problem, with stable learning even from unstable initial policies.

ABSTRACT

We propose a novel reformulation of the stochastic optimal control problem as an approximate inference problem, demonstrating, that such a interpretation leads to new practical methods for the original problem. In particular we characterise a novel class of iterative solutions to the stochastic optimal control problem based on a natural relaxation of the exact dual formulation. These theoretical insights are applied to the Reinforcement Learning problem where they lead to new model free, off policy methods for discrete and continuous problems.

Motivation & Objective

To develop a new theoretical framework that unifies stochastic optimal control and probabilistic inference.
To derive iterative, model-free, off-policy reinforcement learning algorithms from this reformulation.
To demonstrate practical applicability on continuous control problems, including LQG systems.
To show convergence to near-optimal policies even when starting from unstable initial policies.
To provide a generalization beyond prior work by enabling analytical solutions in continuous settings without Monte Carlo approximations.

Proposed method

Reformulates the stochastic optimal control problem as an approximate inference problem using a variational Bayesian approach.
Derives a relaxed dual formulation that enables iterative optimization via natural gradient updates.
Applies the Expectation-Maximization framework to derive a novel class of iterative solutions for control problems.
Proposes the LSΨ algorithm for continuous control, using basis functions to represent policy parameters and updating them via trajectory sampling.
Employs episodic sampling with constraints to ensure stable learning and numerical stability via variance baseline adjustment.
Uses Monte Carlo estimation of expected cost and L2 norm of policy error for evaluation.

Experimental results

Research questions

RQ1Can stochastic optimal control be exactly reformulated as an approximate inference problem without additional assumptions?
RQ2How can the dual formulation of stochastic control be relaxed to yield practical iterative solution methods?
RQ3Can this reformulation lead to new model-free, off-policy reinforcement learning algorithms for both discrete and continuous problems?
RQ4What are the convergence properties of the resulting algorithms when initialized with unstable policies?
RQ5Can analytical solutions be derived in continuous control settings, avoiding costly numerical or Monte Carlo approximations?

Key findings

The LSΨ algorithm successfully learns near-optimal policy gains in a continuous LQG pendulum control problem, as shown by a decreasing L2 norm of policy error over time.
Expected cost under the LSΨ policy converges toward the optimal value, with performance comparable to state-of-the-art methods despite starting from a substantially worse initial policy.
The algorithm stabilizes the system after approximately 600–700 episodes, evidenced by increasing episode lengths, even though the initial policy was unstable.
The method achieves convergence without requiring the initial policy to be stable or the cost function to be discounted, unlike prior approaches.
The use of basis functions allows analytical updates in the continuous case, reducing reliance on computationally expensive Monte Carlo methods.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.