QUICK REVIEW

[论文解读] Boosting Soft Actor-Critic: Emphasizing Recent Experience without Forgetting the Past

Che Wang, Keith W. Ross|arXiv (Cornell University)|Jun 10, 2019

Reinforcement Learning in Robotics参考文献 34被引用 40

一句话总结

该论文提出 Emphasizing Recent Experience (ERE) 以通过偏向最近数据的回放来改进 SAC，同时在保留较早经验的前提下；还探索将 ERE 与 Prioritized Experience Replay (PER) 结合，并在 MuJoCo 环境上进行评估。

ABSTRACT

Soft Actor-Critic (SAC) is an off-policy actor-critic deep reinforcement learning (DRL) algorithm based on maximum entropy reinforcement learning. By combining off-policy updates with an actor-critic formulation, SAC achieves state-of-the-art performance on a range of continuous-action benchmark tasks, outperforming prior on-policy and off-policy methods. The off-policy method employed by SAC samples data uniformly from past experience when performing parameter updates. We propose Emphasizing Recent Experience (ERE), a simple but powerful off-policy sampling technique, which emphasizes recently observed data while not forgetting the past. The ERE algorithm samples more aggressively from recent experience, and also orders the updates to ensure that updates from old data do not overwrite updates from new data. We compare vanilla SAC and SAC+ERE, and show that ERE is more sample efficient than vanilla SAC for continuous-action Mujoco tasks. We also consider combining SAC with Priority Experience Replay (PER), a scheme originally proposed for deep Q-learning which prioritizes the data based on temporal-difference (TD) error. We show that SAC+PER can marginally improve the sample efficiency performance of SAC, but much less so than SAC+ERE. Finally, we propose an algorithm which integrates ERE and PER and show that this hybrid algorithm can give the best results for some of the Mujoco tasks.

研究动机与目标

动机：需要在离策略强化学习中比均匀采样更有效地利用回放数据。
提出 Emphasizing Recent Experience (ERE)：优先最近转换，同时保留过去数据。
在连续控制任务上比较 SAC+ERE、SAC、SAC+PER 和 SAC+ERE+PER。
相对于 PER，评估 ERE 的简易性、计算成本和鲁棒性。
提供关于超参数和更新顺序对 ERE 的重要性指导。

提出的方法

描述 Soft Actor-Critic (SAC) 及其均匀回放基线。
提出 SAC+ERE：从最近数据抽样，并使用有序更新方案以避免旧数据覆盖新更新。
提出 SAC+PER：对 SAC 使用基于 TD 误差的优先级的比例式经验回放。
提出 SAC+ERE+PER：将非均匀、最近数据抽样与 TD 误差优先化结合。
提供伪代码并讨论实现简易性和超参数敏感性。
在 MuJoCo 连续控制任务上使用多种种子和固定结构进行评估。

实验结果

研究问题

RQ1ERE 是否能在不牺牲鲁棒性的前提下提升 SAC 的样本效率？
RQ2就性能提升和复杂性而言，ERE 与 SAC 的 PER 相比如何？
RQ3将 ERE 与 PER 结合（SAC+ERE+PER）是否在单独方法之上带来额外收益？
RQ4关键超参数（如 eta、c_min）以及更新顺序对 ERE 的性能有何影响？
RQ5ERE 的观察性提升是否可在 MuJoCo 环境和不同随机种子间泛化？

主要发现

SAC+ERE 在六个 MuJoCo 环境中，在训练早期和后期阶段均明确优于原生 SAC。
SAC+ERE 能更快达到更高性能（如 Ant-v2 例子），且在 eta 值在 (0.994,0.999) 和退火时具有较强鲁棒性。
SAC+PER 在某些环境中（特别是 Ant-v2）可以提升性能，但在大多数环境中不如 SAC+ERE 稳定有益。
SAC+ERE+PER 在某些环境中可取得最佳结果，但计算成本更高且不如单独的 SAC+ERE 简单。
SAC+ERE 展现出鲁棒性提升，在若干环境的 1.5M 时间步上种子之间的性能波动较低或相当。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。