QUICK REVIEW

[论文解读] On the Effectiveness of Fine-tuning Versus Meta-reinforcement Learning

Zhao Mandi, Pieter Abbeel|arXiv (Cornell University)|Jun 7, 2022

Reinforcement Learning in Robotics被引用 20

一句话总结

本文将 meta-RL 与多任务预训练后再进行在多样化的基于视觉的任务上的微调进行比较，并发现微调通常能匹配甚至超越 meta-RL，同时更简单、成本更低。

ABSTRACT

Intelligent agents should have the ability to leverage knowledge from previously learned tasks in order to learn new ones quickly and efficiently. Meta-learning approaches have emerged as a popular solution to achieve this. However, meta-reinforcement learning (meta-RL) algorithms have thus far been restricted to simple environments with narrow task distributions. Moreover, the paradigm of pretraining followed by fine-tuning to adapt to new tasks has emerged as a simple yet effective solution in supervised and self-supervised learning. This calls into question the benefits of meta-learning approaches also in reinforcement learning, which typically come at the cost of high complexity. We hence investigate meta-RL approaches in a variety of vision-based benchmarks, including Procgen, RLBench, and Atari, where evaluations are made on completely novel tasks. Our findings show that when meta-learning approaches are evaluated on different tasks (rather than different variations of the same task), multi-task pretraining with fine-tuning on new tasks performs equally as well, or better, than meta-pretraining with meta test-time adaptation. This is encouraging for future research, as multi-task pretraining tends to be simpler and computationally cheaper than meta-RL. From these findings, we advocate for evaluating future meta-RL methods on more challenging tasks and including multi-task pretraining with fine-tuning as a simple, yet strong baseline.

研究动机与目标

探究元强化学习在跨越多样任务分布的视觉强化学习中，是否比简单的多任务预训练加微调更具优势。
评估代表性的元强化学习算法（Reptile、PEARL、RL2）相对于多任务预训练加微调的表现。
在三个基准（Procgen、RLBench、Atari）上评估完全新颖的测试任务。
强调未来元强化学习研究中对评估协议和基线选择的影响。

提出的方法

将三种元强化学习方法（Reptile、PEARL、RL2）与多任务训练+微调基线进行比较。
在 Procgen 上以 PPO 作为基础，在 RLBench 上使用 C2F-ARM，在 Atari 上使用 RainbowDQN，且具备按任务的回放缓冲区。
通过在未见任务上进行微调来评估自适应性，并通过从头开始训练进行基线对比。
测试时的自适应涉及每个测试关卡/任务的微调 2 百万次环境步骤（如适用）。
在三个基准上进行大规模实验，涵盖多样的任务分布和高维观测。

实验结果

研究问题

RQ1在新任务上进行多任务预训练并微调的表现，是否与基于视觉的强化学习基准上的元强化学习方法同样好或更好？
RQ2流行的元强化学习算法（Reptile、PEARL、RL2）在跨越多样任务分布方面如何与多任务预训练加微调相比？
RQ3在奖励稀疏、高维观测的情境中，元强化学习相对于简单的预训练-微调有哪些相对优势与局限？
RQ4未来对元强化学习的评估是否应转向更丰富的任务分布，并包含强有力的多任务预训练基线？
RQ5在测试阶段任务严格未见时，Procgen、RLBench 和 Atari 的结果如何变化？

主要发现

在基于视觉的环境中，多任务预训练后再对新任务进行微调的表现与元强化学习基线同样好，甚至更好。
在 Procgen、RLBench 和 Atari 中，简单基线在真正多样的任务分布下通常与元强化学习方法相竞争，甚至更优。
RLBench 的结果显示，多任务预训练可以克服未见任务中的稀疏奖励，并优于从头训练。
RL2 一般无法适应新的关卡/游戏，与先前在困难设置中对元强化学习适应性受限的观测一致。
PEARL 在训练与测试阶段任务视觉上彼此明显不同的分离的训练-测试分割中，适应性困难。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。