[论文解读] Conservative Q-Learning for Offline Reinforcement Learning
保守Q学习(CQL)学习保守的Q函数以在离线RL中界定策略价值,减少估计过高并在离散和连续任务中提升性能。
Effectively leveraging large, previously collected datasets in reinforcement learning (RL) is a key challenge for large-scale real-world applications. Offline RL algorithms promise to learn effective policies from previously-collected, static datasets without further interaction. However, in practice, offline RL presents a major challenge, and standard off-policy RL methods can fail due to overestimation of values induced by the distributional shift between the dataset and the learned policy, especially when training on complex and multi-modal data distributions. In this paper, we propose conservative Q-learning (CQL), which aims to address these limitations by learning a conservative Q-function such that the expected value of a policy under this Q-function lower-bounds its true value. We theoretically show that CQL produces a lower bound on the value of the current policy and that it can be incorporated into a policy learning procedure with theoretical improvement guarantees. In practice, CQL augments the standard Bellman error objective with a simple Q-value regularizer which is straightforward to implement on top of existing deep Q-learning and actor-critic implementations. On both discrete and continuous control domains, we show that CQL substantially outperforms existing offline RL methods, often learning policies that attain 2-5 times higher final return, especially when learning from complex and multi-modal data distributions.
研究动机与目标
- 将离线RL作为在RL中替代在线交互的数据高效方法的动机。
- 在固定数据集上训练时解决过估计和分布偏移问题。
- 提出一个保守Q函数框架,为策略价值提供下界。
- 通过最小的代码修改和强大的实证结果展示鲁棒性与实用兼容性。
提出的方法
- 引入保守Q学习(CQL)作为正则化的Q函数目标,在数据对齐的状态-行动分布下最小化Q值。
- 推导理论保证,显示学习到的Q函数对真实Q函数和策略价值有下界。
- 在统一的优化框架内提供两种实现(CQL(H) 和 CQL(R)),可选的基于KL的正则项。
- 将CQL整合到离线RL算法中,最小实现工作量(在SAC或QR-DQN之上大约20行代码)。
- 提供安全性/保证结果:保守策略改进和间隙扩展的备份,缓解OOD动作。
实验结果
研究问题
- RQ1离线RL中保守Q函数是否能对策略值给出可靠的下界?
- RQ2CQL是否能在不显式建模行为策略的情况下提供安全、提升性能的策略更新?
- RQ3在连续和离散域以及复杂、多模态数据集下,CQL的表现如何?
主要发现
| Task Name | SAC | BC | BEAR | BRAC-p | BRAC-v | CQL(H) |
|---|---|---|---|---|---|---|
| halfcheetah-random | 30.5 | 2.1 | 25.5 | 23.5 | 28.1 | 35.4 |
| hopper-random | 11.3 | 9.8 | 9.5 | 11.1 | 12.0 | 10.8 |
| walker2d-random | 4.1 | 1.6 | 6.7 | 0.8 | 0.5 | 7.0 |
| halfcheetah-medium | -4.3 | 36.1 | 38.6 | 44.0 | 45.5 | 44.4 |
| walker2d-medium | 0.9 | 6.6 | 33.2 | 72.7 | 81.3 | 79.2 |
| hopper-medium | 0.8 | 29.0 | 47.6 | 31.2 | 32.3 | 58.0 |
| halfcheetah-expert | -1.9 | 107.0 | 108.2 | 3.8 | -1.1 | 104.8 |
| hopper-expert | 0.7 | 109.0 | 110.3 | 6.6 | 3.7 | 109.9 |
| walker2d-expert | -0.3 | 125.7 | 106.1 | -0.2 | -0.0 | 15 |
- CQL在多个基准任务上比现有离线RL方法获得2-5倍的最终回报。
- CQL通常能够在真实数据集上超越简单行为克隆。
- 该方法对Q函数估计误差保持鲁棒,支持Q学习和演员-评论家实现。
- CQL可以通过在现有在线RL算法之上的小代码添加实现,使用简单的正则化项。
- 经验结果涵盖高维视觉输入和多模态数据分布,显示广泛适用性。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。