QUICK REVIEW

[论文解读] Offline Reinforcement Learning with Implicit Q-Learning

Ilya Kostrikov, Ashvin Nair|arXiv (Cornell University)|Oct 12, 2021

Reinforcement Learning in Robotics参考文献 23被引用 129

一句话总结

隐式Q学习（IQL）通过使用状态条件的期望值近似在分布内最佳动作，避免在离线训练期间评估未见动作，从而实现多步动态规划，并在D4RL基准上取得出色表现。

ABSTRACT

Offline reinforcement learning requires reconciling two conflicting aims: learning a policy that improves over the behavior policy that collected the dataset, while at the same time minimizing the deviation from the behavior policy so as to avoid errors due to distributional shift. This trade-off is critical, because most current offline reinforcement learning methods need to query the value of unseen actions during training to improve the policy, and therefore need to either constrain these actions to be in-distribution, or else regularize their values. We propose an offline RL method that never needs to evaluate actions outside of the dataset, but still enables the learned policy to improve substantially over the best behavior in the data through generalization. The main insight in our work is that, instead of evaluating unseen actions from the latest policy, we can approximate the policy improvement step implicitly by treating the state value function as a random variable, with randomness determined by the action (while still integrating over the dynamics to avoid excessive optimism), and then taking a state conditional upper expectile of this random variable to estimate the value of the best actions in that state. This leverages the generalization capacity of the function approximator to estimate the value of the best available action at a given state without ever directly querying a Q-function with this unseen action. Our algorithm alternates between fitting this upper expectile value function and backing it up into a Q-function. Then, we extract the policy via advantage-weighted behavioral cloning. We dub our method implicit Q-learning (IQL). IQL demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline reinforcement learning. We also demonstrate that IQL achieves strong performance fine-tuning using online interaction after offline initialization.

研究动机与目标

在数据来自固定数据集且在线探索成本高或危险的情形下，推动离线强化学习的研究。
引入一种在值学习过程中不对未在数据中的动作进行查询的方法。
利用期望回归（expectile regression）通过数据集的动作覆盖范围隐式执行策略改进。
在训练阶段无需显式策略即可实现多步动态规划，随后进行简单的策略提取步骤。
在D4RL基准上展示出色的经验性能，并对离线初始化具有鲁棒性。

提出的方法

定义一个不对称的期望回归目标，用以估计状态-动作值，同时将目标约束在数据集内的动作上。
使用一个单独的价值函数V，它近似Q相对于动作分布的期望值，随后用 r(s,a)+γV(s′) 对Q进行备份。
通过交替更新，用期望损失和类似SARSA的TD目标来训练Q和V，避免对分布外的动作进行查询。
通过优势加权行为克隆（AWR）提取策略，该方法在不查询未见动作的情况下使用Q与V。
采用裁剪双Q学习以稳定V与策略更新，并使用两个Q函数进行目标估计。
提供一个易于从标准的类SARSA更新修改的实现，并在现代硬件上高效运行。
讨论通过与在线数据并行学习来进行在线微调。

实验结果

研究问题

RQ1离线RL在不对分布外动作进行查询的情况下，是否能够实现对行为策略的显著政策改进？
RQ2基于期望回归的在支撑内动作值学习是否能够在离线RL中实现有效的多步动态规划？
RQ3在D4RL基准，特别是Ant Maze任务上，IQL与多步和单步离线RL方法相比如何？
RQ4在不进行分布外查询的前提下，使用简单的策略提取方法（优势加权回归）是否足够？
RQ5离线初始化后，IQL是否能够有效地在线微调？

主要发现

数据集	BC	10%BC	DT	AWAC	单步 RL	TD3+BC	CQL	IQL（本方法）
halfcheetah-medium-v2	42.6	42.5	42.6	43.5	48.4	48.3	44.0	47.4
hopper-medium-v2	52.9	56.9	67.6	57.0	59.6	59.3	58.5	66.3
walker2d-medium-v2	75.3	75.0	74.0	72.4	81.8	83.7	72.5	78.3
halfcheetah-medium-replay-v2	36.6	40.6	36.6	40.5	38.1	44.6	45.5	44.2
hopper-medium-replay-v2	18.1	75.9	82.7	37.2	97.5	60.9	95.0	94.7
walker2d-medium-replay-v2	26.0	62.5	66.6	27.0	49.5	81.8	77.2	73.9
halfcheetah-medium-expert-v2	55.2	92.9	86.8	42.8	93.4	90.7	91.6	86.7
hopper-medium-expert-v2	52.5	110.9	107.6	55.8	103.3	98.0	105.4	91.5
walker2d-medium-expert-v2	107.5	109.0	108.1	74.5	113.0	110.1	108.8	109.6
locomotion-v2 total	466.7	666.2	672.6	450.7	684.6	677.4	698.5	692.4
antmaze-umaze-v0	54.6	62.8	59.2	56.7	64.3	78.6	74.0	87.5
antmaze-umaze-diverse-v0	45.6	50.2	53.0	49.3	60.7	71.4	84.0	62.2
antmaze-medium-play-v0	0.0	5.4	0.0	0.0	0.3	10.6	61.2	71.2
antmaze-medium-diverse-v0	0.0	9.8	0.0	0.7	0.0	3.0	53.7	70.0
antmaze-large-play-v0	0.0	0.0	0.0	0.0	0.0	0.2	15.8	39.6
antmaze-large-diverse-v0	0.0	6.0	0.0	1.0	0.0	0.0	14.9	47.5
antmaze-v0 total	100.2	134.2	112.2	107.7	125.3	163.8	303.6	378.0
total	566.9	800.4	784.8	558.4	809.9	841.2	1002.1	1070.4

IQL在Ant Maze任务上实现了最先进的性能，该领域需要多步动态规划来拼接次优轨迹。
在MuJoCo运动任务中，IQL与已有最佳方法（尤其是CQL）竞争力相当。
IQL计算效率高，例如在GTX1080上完成1M次更新可在不到20分钟内完成，且运行速度快于重新实现的基线。
更大的期望值τ对于拼接任务至关重要，较高的τ在Ant Maze中给出更接近Q-learning的近似。
离线结果得到在线微调的支持，其中IQL初始化后再进行在线交互，在报道的设置中相较AWAC或CQL具有竞争力甚至更优的最终性能。
IQL通过简单的加权行为克隆提取出有效策略，在值学习阶段避免对分布外动作的显式查询。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。