QUICK REVIEW

[论文解读] Verifiable Reinforcement Learning via Policy Extraction

Osbert Bastani, Yewen Pu|arXiv (Cornell University)|May 22, 2018

Reinforcement Learning in Robotics参考文献 33被引用 115

一句话总结

本文提出 Viper，一种从高性能的 DNN 预测模型及其 Q 函数中提取紧凑、可验证的决策树策略的方法，使对强化学习任务的安全性、鲁棒性和稳定性得到高效验证。

ABSTRACT

While deep reinforcement learning has successfully solved many challenging control tasks, its real-world applicability has been limited by the inability to ensure the safety of learned policies. We propose an approach to verifiable reinforcement learning by training decision tree policies, which can represent complex policies (since they are nonparametric), yet can be efficiently verified using existing techniques (since they are highly structured). The challenge is that decision tree policies are difficult to train. We propose VIPER, an algorithm that combines ideas from model compression and imitation learning to learn decision tree policies guided by a DNN policy (called the oracle) and its Q-function, and show that it substantially outperforms two baselines. We use VIPER to (i) learn a provably robust decision tree policy for a variant of Atari Pong with a symbolic state space, (ii) learn a decision tree policy for a toy game based on Pong that provably never loses, and (iii) learn a provably stable decision tree policy for cart-pole. In each case, the decision tree policy achieves performance equal to that of the original DNN policy.

研究动机与目标

在安全关键的强化学习设置中激发对可验证策略的需求。
开发一个策略提取流程，从深度策略中产生可验证的、非参数的决策树。
通过利用 Q 函数，在样本效率和策略大小方面优于先前的模仿学习基线。
通过对多任务的正确性、鲁棒性和稳定性分析，展示可验证性。

提出的方法

定义 Q-Dagger，一种使用预测模型的 Q 函数来引导训练的模仿学习算法。
引入 Viper，通过基于凸损失代理的加权重新采样数据并用 CART 训练树来提取决策树策略。
给出理论对比，显示相较于以往工作，Q-Dagger 的性能界限更紧。
将 Viper 应用于提取在选定任务上实现最优或完美回报的紧凑树。
调整验证技术以检验提取树的正确性（toy Pong）、鲁棒性（Atari Pong）和稳定性（cart-pole）。

实验结果

研究问题

RQ1从 DNN 预测模型学习的决策树策略能达到与原策略相近的性能吗？
RQ2在模仿学习中利用 Q 函数能否产生比 Dagger 更小、可验证性更强的策略？
RQ3能否高效地验证提取的决策树策略在基准任务上的正确性、鲁棒性和稳定性？
RQ4在这些设置中，策略规模、可验证性和获得的回报之间存在哪些权衡？

主要发现

Viper 学习出相对较小的决策树（<1000 个节点），在 Atari Pong（符号状态空间）、基于 Pong 的玩具游戏以及摆-杆任务上实现完美或接近完美的回报。
相较于 Dagger，Viper 产生的树显著更小（例如 31-769 节点，而非数千），同时与预测模型保持同等性能。
Viper 相比与 DNN 策略兼容的方法，更高效地实现正确性、鲁棒性和稳定性的验证。
在 Atari Pong 中，Viper 推导的树实现了完美回报；在若干样本状态下对鲁棒性的量化显示出可测的边界。
在摆-杆任务中，一个小型树实现了完美回报，且基于 SOS 的方法在原点附近对一个五阶泰勒模型的稳定性进行了验证。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。