QUICK REVIEW

[论文解读] Offline Reinforcement Learning with Fisher Divergence Critic Regularization

Ilya Kostrikov, Jonathan Tompson|arXiv (Cornell University)|Mar 14, 2021

Reinforcement Learning in Robotics参考文献 32被引用 25

一句话总结

Fisher-BRC 引入一个基于对数行为策略的批评者，带有梯度惩罚，实现 Fisher 散度正则化，提供最先进的离线 RL 性能，收敛更快、稳定性更好。

ABSTRACT

Many modern approaches to offline Reinforcement Learning (RL) utilize behavior regularization, typically augmenting a model-free actor critic algorithm with a penalty measuring divergence of the policy from the offline data. In this work, we propose an alternative approach to encouraging the learned policy to stay close to the data, namely parameterizing the critic as the log-behavior-policy, which generated the offline data, plus a state-action value offset term, which can be learned using a neural network. Behavior regularization then corresponds to an appropriate regularizer on the offset term. We propose using a gradient penalty regularizer for the offset term and demonstrate its equivalence to Fisher divergence regularization, suggesting connections to the score matching and generative energy-based model literature. We thus term our resulting algorithm Fisher-BRC (Behavior Regularized Critic). On standard offline RL benchmarks, Fisher-BRC achieves both improved performance and faster convergence over existing state-of-the-art methods.

研究动机与目标

为离线 RL 中的批评者提出一个调控器，以防止对未见动作的外推。
提出一种批评者参数化，将 Q 值通过对数密度项与一个可学习的偏移量，与行为策略相关联。
推导一个用于偏移的梯度惩罚正则化项，该正则化等价于 Fisher 散度正则化。
展示 Fisher-BRC 在性能和效率方面相对于现有离线 RL 方法的经验优势。

提出的方法

将批评者参数化为 Q(s,a)=Oθ(s,a)+log μ(a|s)，其中 μ 是通过行为克隆学习得到的行为策略。
通过对 Oθ 应用梯度惩罚来正则化批评者：最小化 J(Oθ+log μ) + λ E_{s,a∼D, a∼πφ(·|s)}[||∇a Oθ(s,a)||^2]。
将梯度惩罚与 Boltzmann 策略 exp(Q)/Z 与行为策略 μ 之间的 Fisher 散度联系起来，避免显式归一化。
对演员（策略）进行熵正则化训练，利用基于偏移的批评者将动作引向数据，同时允许超出数据集的泛化。
将 Fisher-BRC 与 BRAC 与 CQL 联系起来，指出相较于基于 log-sum-exp 的目标函数具有计算上的优势。

实验结果

研究问题

RQ1基于对数行为策略的批评者结合偏移梯度正则化，能否实现稳健的离线 RL 性能？
RQ2对批评者偏移的梯度惩罚是否实现 Fisher 散度正则化，并相对于传统的散度惩罚策略带来收益？
RQ3Fisher-BRC 方法在标准基准上是否相比最先进的离线 RL 基线提供更快的收敛和更好的稳定性？

主要发现

环境	BC	BRAC-p	BRAC-v	MBOP	CQL（GitHub）	CQL（我们的）	F-BRC（我们的）
halfcheetah-random	30.5	23.5	28.1	6.3±4.0	27.1±1.3	20.7±0.6	33.3±1.3
hopper-random	11.3	11.1	12.0	10.8±0.3	10.6±0.1	10.4±0.1	11.3±0.2
walker2d-random	4.1	0.8	0.5	8.1±5.5	1.1±2.2	10.0±4.6	1.5±0.7
halfcheetah-medium	36.1	44.0	45.5	44.6±0.8	40.3±0.3	38.9±0.3	41.3±0.3
walker2d-medium	6.6	72.7	81.3	41.0±29.4	77.3±3.8	69.2±8.3	78.8±1.0
hopper-medium	29.0	31.2	32.3	48.8±26.8	42.2±15.5	30.5±0.7	99.4±0.3
halfcheetah-expert	107.0	3.8	-1.1	-	54.4±45.8	103.5±1.3	108.4±0.5
hopper-expert	109.0	6.6	3.7	-	67.7±54.7	112.2±0.2	112.3±0.1
walker2d-expert	125.7	-0.2	-0.0	-	84.7±42.7	107.2±3.8	103.0±5.0
halfcheetah-medium-expert	35.8	43.8	45.3	105.9±17.8	21.7±6.8	58.6±8.7	93.3±10.2
walker2d-medium-expert	11.3	-0.3	0.9	70.2±36.2	104.0±10.1	104.6±10.4	105.2±3.9
hopper-medium-expert	111.9	1.1	0.8	55.1±44.3	111.3±2.1	112.4±0.2	112.4±0.3
halfcheetah-mixed	38.4	45.6	45.9	42.3±0.9	44.9±1.1	42.0±1.1	43.2±1.5
hopper-mixed	11.8	0.7	0.8	12.4±5.8	31.6±3.6	29.0±0.5	35.6±1.0
walker2d-mixed	11.3	-0.3	0.9	9.7±5.3	16.8±3.1	16.5±4.9	41.8±7.9

Fisher-BRC 在 D4RL 基准上取得与最先进结果相当的成绩，且性能比若干基线更具一致性。
梯度惩罚至关重要：λ=0 会降低性能，而非常大的 λ 会过度约束策略。
Fisher-BRC 在梯度步数和实际耗时上比 CQL 和 BRAC 收敛得更快。
该方法在 medium 和 expert 数据集上尤为表现出色，显示出跨任务的鲁棒性。
该方法通过避免昂贵的 log-sum-exp 计算，降低了相对于 CQL 的计算负担。
经验结果显示 F-BRC 在大多数任务上达到或超过基线，并具备改进的收敛性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。