[论文解读] Third-Person Imitation Learning
本文提出一种无监督的三人称模仿学习方法,使用域混淆和类似 GAN 的框架,从不同视角的示范中学习,使在新域中能够进行策略学习。它在简单的 MuJoCo 任务(pointmass、reacher、inverted pendulum)上取得成功,且无需第一人称示范。
Reinforcement learning (RL) makes it possible to train agents capable of achieving sophisticated goals in complex and uncertain environments. A key difficulty in reinforcement learning is specifying a reward function for the agent to optimize. Traditionally, imitation learning in RL has been used to overcome this problem. Unfortunately, hitherto imitation learning methods tend to require that demonstrations are supplied in the first-person: the agent is provided with a sequence of states and a specification of the actions that it should have taken. While powerful, this kind of imitation learning is limited by the relatively hard problem of collecting first-person demonstrations. Humans address this problem by learning from third-person demonstrations: they observe other humans perform tasks, infer the task, and accomplish the same task themselves. In this paper, we present a method for unsupervised third-person imitation learning. Here third-person refers to training an agent to correctly achieve a simple goal in a simple environment when it is provided a demonstration of a teacher achieving the same goal but from a different viewpoint; and unsupervised refers to the fact that the agent receives only these third-person demonstrations, and is not provided a correspondence between teacher states and student states. Our methods primary insight is that recent advances from domain confusion can be utilized to yield domain agnostic features which are crucial during the training process. To validate our approach, we report successful experiments on learning from third-person demonstrations in a pointmass domain, a reacher domain, and inverted pendulum.
研究动机与目标
- 解决在教师与学生状态之间不存在对应关系时,如何从三人称演示中学习的问题。
- 开发领域无关的表示和奖励信号,以引导从原始观测中进行模仿。
- 在 novice 域使用来自不同域和视角的专家演示,实现策略学习。
提出的方法
- 将三人 RL-GAN 建模为判别器根据领域无关特征区分专家轨迹与非专家轨迹。
- 将判别器拆分为特征提取器 D_F 和分类器 D_R;引入领域分类器 D_D,通过梯度反转实现领域不变性。
- 使用基于互信息的目标来确保 D_F 去除领域特定信息,同时仍能实现判别。
- 引入梯度翻转(G)将领域损失以相反符号反向传播,促进领域无关特征。
- 用信任区域政策优化(TRPO)训练 imitator 策略 π_θ,使用基于判别器的奖励 −log D_R。
- 将输入扩展为多时间步观测(o_t, o_{t+n}),以提升判别信号。
实验结果
研究问题
- RQ1在观测来自不同域和视角的情况下,三人称模仿学习是否可以在简单任务中解决?
- RQ2引入域混淆和多时间步输入是否提升三人称模仿任务的性能?
- RQ3方法对超参数(如域混淆权重 λ、前瞻帧数)的敏感性如何?
- RQ4相机角度在专家域与新手域之间的差异如何影响学习?
- RQ5所提方法与基线(如真奖励 RL、第一人称模仿学习)相比如何?
主要发现
- 该方法能够从三人称示范中学习出对点质、reacher 和 inverted pendulum 的合理策略。
- 域混淆对于三项任务的强性能至关重要;多时间步输入提供了额外的增益。
- 特征表示变得领域无关,表明从原始观测实现了三人称学习。
- 该方法与第一人称模仿具有竞争力,在某些情况下甚至接近真奖励 RL 的表现;直接将第一人称策略应用于三人称域可能失败。
- 超参数分析显示 λ 需要谨慎平衡,约 4 帧的前瞻窗口在各任务上都表现良好。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。