QUICK REVIEW

[论文解读] Conservative Q-Learning for Offline Reinforcement Learning

Aviral Kumar, Aurick Zhou|arXiv (Cornell University)|Jun 8, 2020

Reinforcement Learning in Robotics参考文献 60被引用 535

一句话总结

保守Q学习（CQL）学习保守的Q函数以在离线RL中界定策略价值，减少估计过高并在离散和连续任务中提升性能。

ABSTRACT

Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.

研究动机与目标

将离线RL作为在RL中替代在线交互的数据高效方法的动机。
在固定数据集上训练时解决过估计和分布偏移问题。
提出一个保守Q函数框架，为策略价值提供下界。
通过最小的代码修改和强大的实证结果展示鲁棒性与实用兼容性。

提出的方法

引入保守Q学习（CQL）作为正则化的Q函数目标，在数据对齐的状态-行动分布下最小化Q值。
推导理论保证，显示学习到的Q函数对真实Q函数和策略价值有下界。
在统一的优化框架内提供两种实现（CQL(H) 和 CQL(R)），可选的基于KL的正则项。
将CQL整合到离线RL算法中，最小实现工作量（在SAC或QR-DQN之上大约20行代码）。
提供安全性/保证结果：保守策略改进和间隙扩展的备份，缓解OOD动作。

实验结果

研究问题

RQ1离线RL中保守Q函数是否能对策略值给出可靠的下界？
RQ2CQL是否能在不显式建模行为策略的情况下提供安全、提升性能的策略更新？
RQ3在连续和离散域以及复杂、多模态数据集下，CQL的表现如何？

主要发现

Task Name	SAC	BC	BEAR	BRAC-p	BRAC-v	CQL(H)
halfcheetah-random	30.5	2.1	25.5	23.5	28.1	35.4
hopper-random	11.3	9.8	9.5	11.1	12.0	10.8
walker2d-random	4.1	1.6	6.7	0.8	0.5	7.0
halfcheetah-medium	-4.3	36.1	38.6	44.0	45.5	44.4
walker2d-medium	0.9	6.6	33.2	72.7	81.3	79.2
hopper-medium	0.8	29.0	47.6	31.2	32.3	58.0
halfcheetah-expert	-1.9	107.0	108.2	3.8	-1.1	104.8
hopper-expert	0.7	109.0	110.3	6.6	3.7	109.9
walker2d-expert	-0.3	125.7	106.1	-0.2	-0.0	15

CQL在多个基准任务上比现有离线RL方法获得2-5倍的最终回报。
CQL通常能够在真实数据集上超越简单行为克隆。
该方法对Q函数估计误差保持鲁棒，支持Q学习和演员-评论家实现。
CQL可以通过在现有在线RL算法之上的小代码添加实现，使用简单的正则化项。
经验结果涵盖高维视觉输入和多模态数据分布，显示广泛适用性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。