QUICK REVIEW

[论文解读] HJB Optimal Feedback Control with Deep Differential Value Functions and Action Constraints.

Michael Lutter, Boris Belousov|arXiv (Cornell University)|Jan 1, 2019

Reinforcement Learning in Robotics被引用 6

一句话总结

该论文提出了一种深度最优反馈控制方法，通过将深度微分网络嵌入汉密尔顿-雅可比-贝尔曼（HJB）方程，为连续时间机器人系统学习全局最优、稳定的反馈策略。通过施加严格凸的动作成本以约束动作，并在训练过程中将折扣因子从短期视野动态调整为远期视野，该方法学习到的值函数可实现从任意初始状态出发的最优轨迹，且无需重规划，在非线性系统上的表现优于标准最优控制方法。

ABSTRACT

Learning optimal feedback control laws capable of executing optimal trajectories is essential for many robotic applications. Such policies can be learned using reinforcement learning or planned using optimal control. While reinforcement learning is sample inefficient, optimal control only plans an optimal trajectory from a specific starting configuration. In this paper we propose deep optimal feedback control to learn an optimal feedback policy rather than a single trajectory. By exploiting the inherent structure of the robot dynamics and strictly convex action cost, we can derive principled cost functions such that the optimal policy naturally obeys the action limits, is globally optimal and stable on the training domain given the optimal value function. The corresponding optimal value function is learned end-to-end by embedding a deep differential network in the Hamilton-Jacobi-Bellmann differential equation and minimizing the error of this equality while simultaneously decreasing the discounting from short- to far-sighted to enable the learning. Our proposed approach enables us to learn an optimal feedback control law in continuous time, that in contrast to existing approaches generates an optimal trajectory from any point in state-space without the need of replanning. The resulting approach is evaluated on non-linear systems and achieves optimal feedback control, where standard optimal control methods require frequent replanning.

研究动机与目标

开发一种在整个训练域内全局最优且稳定的反馈控制策略，而非仅规划单条最优轨迹。
解决标准最优控制方法在初始条件偏离预定轨迹时需频繁重规划的局限性。
将深度神经网络嵌入HJB偏微分方程，以端到端方式学习最优值函数，同时满足动作约束。
通过学习反馈律实现连续时间最优控制，从而从状态空间中的任意初始状态生成最优轨迹。
通过最小化HJB误差并动态调整折扣因子（从短期视野到远期视野），提升样本效率与泛化能力。

提出的方法

该方法采用汉密尔顿-雅可比-贝尔曼（HJB）方程来表述最优控制问题，该方程刻画了连续时间系统的最优值函数。
将深度微分网络嵌入HJB方程，以参数化最优值函数，从而通过梯度下降实现端到端学习。
通过成本函数中的严格凸动作成本项施加动作约束，确保最优策略自然满足执行器限制。
训练目标是在状态空间中最小化HJB方程的残差误差，同时动态调整折扣因子，实现从短时域到长时域优化的过渡。
该方法采用可微分架构，支持通过HJB方程进行反向传播，从而实现值函数与策略的联合优化。
所得策略为基于学习到的值函数梯度推导出的反馈律，确保在训练域内具有全局最优性与稳定性。

实验结果

研究问题

RQ1能否有效将深度神经网络嵌入HJB方程，以端到端方式学习满足动作约束的最优反馈策略？
RQ2所提方法是否能在整个训练域内实现全局最优且稳定的控制，而无需重规划？
RQ3该方法是否可通过自适应折扣机制实现短期与长期优化的动态平衡，从而提升策略泛化能力？
RQ4与标准最优控制方法相比，该反馈控制律在轨迹最优性及对初始条件变化的鲁棒性方面表现如何？
RQ5该方法在无需重规划的情况下，对新初始状态的泛化能力有多强？在非线性系统中如何保持最优性？

主要发现

所提方法成功学习到一种在整个训练域内全局最优且稳定的反馈控制策略，无需从新初始状态重规划。
嵌入HJB方程的深度微分网络有效最小化了残差误差，实现了对最优值函数的精确逼近。
通过引入严格凸动作成本，策略自然满足执行器限制，确保了物理可实现性。
自适应折扣机制使网络能够同时学习短期与长期成本贡献，提升了收敛性与泛化能力。
该方法在非线性系统中实现了最优反馈控制，而标准最优控制方法需频繁重规划以维持性能。
所得策略可从状态空间中的任意点生成最优轨迹，展现出超越初始轨迹的鲁棒性与泛化能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。