QUICK REVIEW

[论文解读] DisCor: Corrective Feedback in Reinforcement Learning via Distribution Correction

Aviral Kumar, Abhishek Gupta|arXiv (Cornell University)|Mar 16, 2020

Reinforcement Learning in Robotics参考文献 51被引用 36

一句话总结

该论文识别自举RL方法中缺乏校正反馈，并提出DisCor，一种分布校正重加权策略，以在多任务和嘈杂奖励设置中提高稳定性和性能。

ABSTRACT

Deep reinforcement learning can learn effective policies for a wide range of tasks, but is notoriously difficult to use due to instability and sensitivity to hyperparameters. The reasons for this remain unclear. When using standard supervised methods (e.g., for bandits), on-policy data collection provides "hard negatives" that correct the model in precisely those states and actions that the policy is likely to visit. We call this phenomenon "corrective feedback." We show that bootstrapping-based Q-learning algorithms do not necessarily benefit from this corrective feedback, and training on the experience collected by the algorithm is not sufficient to correct errors in the Q-function. In fact, Q-learning and related methods can exhibit pathological interactions between the distribution of experience collected by the agent and the policy induced by training on that experience, leading to potential instability, sub-optimal convergence, and poor results when learning from noisy, sparse or delayed rewards. We demonstrate the existence of this problem, both theoretically and empirically. We then show that a specific correction to the data distribution can mitigate this issue. Based on these observations, we propose a new algorithm, DisCor, which computes an approximation to this optimal distribution and uses it to re-weight the transitions used for training, resulting in substantial improvements in a range of challenging RL settings, such as multi-task learning and learning from noisy reward signals. Blog post presenting a summary of this work is available at: https://bair.berkeley.edu/blog/2020/03/16/discor/.

研究动机与目标

研究为什么在基于ADP的RL中，自举的价值目标未能从纠正反馈中受益。
在理论和经验上证明由于数据分布-值函数之间的相互作用导致的不稳定性和次优收敛。
开发一种实用的数据分布校正方法，以恢复纠正反馈并稳定学习。
证明 DisCor 在多任务和嘈杂奖励场景下能提升性能。

提出的方法

使用类似赌博机的直觉和形式定义分析纠正反馈的概念。
推导在Bellman更新下最大化纠正反馈的最优数据分布 p_k。
提出 Q*-相关量的可处理代理，并使用重要性权重对回放缓冲区样本进行重新加权。
引入一个实用的权重函数 w_k(s,a)，与 exp(-gamma [P^{pi_{k-1}} Δ_{k-1}](s,a)/tau) 成正比。
训练一个二级模型 Δ_phi 来估计用于加权和误差建模的引导/备份误差 Δ_k。
提供算法 DisCor，将加权Bellman备份与二级 Δ 模型结合在标准 DQN/SAC 框架之上。

实验结果

研究问题

RQ1哪些机制导致自举式RL方法中纠正反馈的缺失？
RQ2在训练期间如何校正数据分布以最大化纠正反馈？
RQ3通过最优分布对转移进行重新加权是否在实践中提高稳定性和性能？
RQ4在像多任务RL和从嘈杂奖励学习等具有挑战性的设置中，DisCor 的表现如何？

主要发现

在ADP方法中纠正反馈可能缺失，即使有重放缓冲也会导致次优收敛和不稳定。
一个最优训练分布 p_k 将更高概率分配给高Bellman误差区域，同时考虑与 Q* 的接近程度，通过可处理的代理量来缓解。
基于估计的纠正潜力使用权重 w_k 对回放缓冲区转移进行重新加权，降低误差累积并稳定学习。
DisCor 在具有挑战性的设置中提升了性能，特别是在 MT10 多任务基准测试中，相比 SAC 的最终成功率约高出 50%（按报道结果）。
该方法与标准的基于ADP的深度RL算法（如 DQN 和 SAC）兼容，并支持从嘈杂奖励信号和多任务场景中学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。