QUICK REVIEW

[论文解读] Online Multi-Task Learning Using Biased Sampling

Sahil Sharma, Balaraman Ravindran|arXiv (Cornell University)|Feb 20, 2017

Advanced Bandit Algorithms Research被引用 1

一句话总结

本文提出了一种在线、无需专家的多任务强化学习框架，通过偏差采样优先处理训练中的困难任务。通过将任务选择建模为多臂赌博机或强化学习问题，该方法在无需预训练专家策略的情况下，在多样化的Atari 2600任务中实现了优异性能，展示了在6、8、12和21个任务设置下的高效学习能力。

ABSTRACT

One of the long-standing challenges in Artificial Intelligence for learning goal-directed behavior is to build a single agent which can solve multiple tasks. Recent progress in multi-task learning for goal-directed sequential problems has been in the form of distillation based learning wherein a student network learns from multiple task-specific expert networks by mimicking the task-specific policies of the expert networks. While such approaches offer a promising solution to the multi-task learning problem, they require supervision from large expert networks which require extensive data and computation time for training. In this work, we propose an efficient multi-task learning framework which solves multiple goal-directed tasks in an on-line setup without the need for expert supervision. Our work uses active learning principles to achieve multi-task learning by sampling the harder tasks more than the easier ones. We propose three distinct models under our active sampling framework. An adaptive method with extremely competitive multi-tasking performance. A UCB-based meta-learner which casts the problem of picking the next task to train on as a multi-armed bandit problem. A meta-learning method that casts the next-task picking problem as a full Reinforcement Learning problem and uses actor critic methods for optimizing the multi-tasking performance directly. We demonstrate results in the Atari 2600 domain on seven multi-tasking instances: three 6-task instances, one 8-task instance, two 12-task instances and one 21-task instance.

研究动机与目标

解决单个智能体在不依赖预训练专家网络的情况下，同时学习解决多个目标导向任务的挑战。
克服基于蒸馏的多任务学习方法所面临的高计算与数据成本问题，这些方法依赖于大型专家模型的监督。
开发一种在线学习框架，根据任务难度动态选择任务，提升样本效率与性能表现。
探究主动采样更困难任务是否能在多任务强化学习中优于均匀或随机的任务选择策略。
在持续学习设置中，验证元学习与基于赌博机的策略在选择下一个训练任务时的有效性。

提出的方法

利用主动学习原则，通过偏差采样将采样重点偏向更困难的任务，从而在在线学习过程中提高其训练频率。
提出一种自适应采样策略，根据观测到的学习进度与任务难度动态调整任务选择概率。
采用基于UCB的元学习器，将任务选择建模为多臂赌博机问题，实现任务选择中探索与利用的平衡。
开发一种完整的基于强化学习的元学习器，利用演员-评论家方法通过策略学习优化长期多任务性能。
在在线、持续学习设置中应用这些方法，智能体一次仅训练一个任务，根据学习到的选择策略决定下一个训练任务。
在七种多任务Atari 2600环境上训练并评估所有模型，任务数量分别为6、8、12和21个。

实验结果

研究问题

RQ1是否可以在无需专家监督或预训练策略的情况下，有效实现在线多任务学习？
RQ2通过偏差采样优先处理更困难任务，是否能相比均匀或随机任务选择策略，显著提升多任务性能？
RQ3不同的元学习策略——基于UCB的赌博机选择与完整的演员-评论家强化学习任务选择——对学习效率与最终性能有何影响？
RQ4自适应采样策略在不同任务数量的多样化多任务环境中，其泛化能力如何？
RQ5所提出的框架是否能在无需蒸馏或专家演示的情况下，实现复杂、高维控制任务（如Atari 2600套件中的任务）的竞争力性能？

主要发现

所提出的方法在所有七个测试的Atari 2600环境中均实现了具有竞争力的多任务性能，且无需任何专家监督。
自适应采样方法表现优异，显著优于采用均匀或随机采样任务的基线策略。
基于UCB的元学习器在任务选择中有效平衡了探索与利用，实现了多任务间稳定且高效的训练。
采用演员-评论家方法的完整强化学习元学习器直接优化多任务性能，尤其在12和21个任务的复杂设置中表现强劲。
该框架成功实现了在在线、持续学习设置下解决多个目标导向任务，显著降低了对昂贵专家网络的依赖。
该方法在不同任务数量（包括6、8、12和21个任务）的场景中均表现出良好泛化能力，显示出对规模变化的鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。