[论文解读] Time-Contrastive Networks: Self-Supervised Learning from Video
论文提出 Time-Contrastive Networks (TCN),一种自监督的多视角表征学习方法,来自未标注视频,使第三人称模仿和基于RL的机器人控制仅依赖视觉输入。
We propose a self-supervised approach for learning representations and robotic behaviors entirely from unlabeled videos recorded from multiple viewpoints, and study how this representation can be used in two robotic imitation settings: imitating object interactions from videos of humans, and imitating human poses. Imitation of human behavior requires a viewpoint-invariant representation that captures the relationships between end-effectors (hands or robot grippers) and the environment, object attributes, and body pose. We train our representations using a metric learning loss, where multiple simultaneous viewpoints of the same observation are attracted in the embedding space, while being repelled from temporal neighbors which are often visually similar but functionally different. In other words, the model simultaneously learns to recognize what is common between different-looking images, and what is different between similar-looking images. This signal causes our model to discover attributes that do not change across viewpoint, but do change across time, while ignoring nuisance variables such as occlusions, motion blur, lighting and background. We demonstrate that this representation can be used by a robot to directly mimic human poses without an explicit correspondence, and that it can be used as a reward function within a reinforcement learning algorithm. While representations are learned from an unlabeled collection of task-related videos, robot behaviors such as pouring are learned by watching a single 3rd-person demonstration by a human. Reward functions obtained by following the human demonstrations under the learned representation enable efficient reinforcement learning that is practical for real-world robotic systems. Video results, open-source code and dataset are available at https://sermanet.github.io/imitate
研究动机与目标
- 从未标注的多视角视频学习视角不变、解耦的对象交互与姿态表征。
- Enable imitation of human behavior from third-person video without explicit pose labels or correspondences.
- Provide a reward signal for reinforcement learning using TCN embeddings learned from video data.
- Demonstrate pouring and dish-rack manipulation tasks in simulation and on real robots using TCN-based guidance.
提出的方法
- 使用来自不同视角的同现帧(锚点、正样本)对,结合三元组损失来训练嵌入 f(x),与时间上接近的负样本对比。
- 使用多视角数据对视觉变化进行定位与消解,从而实现视角、遮挡、光照和背景不变性。
- 当多视角数据不可用时,任选使用单视角TC损失,设定正样本窗口。
- 利用32维TCN嵌入来构建强化学习的奖励函数,通过平方距离项加上Huber风格项。
- 将TCN特征整合到基于PILQR的策略优化中,以从视频演示学习操控任务。
- 通过自回归的直接姿态模仿,使用在人体和机器人动作上训练的共享TCN嵌入。
实验结果
研究问题
- RQ1Time-Contrastive Networks 能否学习出在保持视角与外观不变的同时,解耦姿态与对象交互的表征?
- RQ2学到的TCN嵌入是否能为RL提供稳健的奖励信号,以从第三人称演示中获得复杂的操控技能?
- RQ3是否可以在没有显式姿态或对应标签的情况下,从第三人称视频进行模仿?
- RQ4多视角与单视角训练信号如何影响表征质量和机器人学习结果?
- RQ5TCN 是否能在没有姿态标签的情况下,支持对人类姿态的实时、连续模仿?
主要发现
| 方法 | 对齐误差 | 分类误差 | 训练迭代 |
|---|---|---|---|
| Random | 28.1% | 54.2% | - |
| Inception-ImageNet | 29.8% | 51.9% | - |
| shuffle & learn [31] | 22.8% | 27.0% | 575k |
| single-view TCN (triplet) | 25.8% | 24.3% | 266k |
| multi-view TCN (npairs) | 18.1% | 22.2% | 938k |
| multi-view TCN (triplet) | 18.8% | 21.4% | 397k |
| multi-view TCN (lifted) | 18.0% | 19.6% | 119k |
- Multi-view TCNs outperform baselines in both alignment and attribute classification for pouring tasks.
- mvTCN 在对倒液任务的对齐与属性分类方面超越基线。
- mvTCN enables efficient real-world pouring and dish-rack manipulation, with pouring performance converging after about 10 iterations on a real robot.
- mvTCN 促成现实世界中的高效浇注和碗架操作,在真实机器人上大约10次迭代后浇注性能收敛。
- Single-view TCNs and shuffle-and-learn baselines underperform relative to mvTCN, despite identical data; multi-view signals accelerate learning.
- 单视角TCN和shuffle-and-learn基线相对于mvTCN表现不足,尽管数据相同;多视角信号加速学习。
- TCN-based rewards enable PILQR-based reinforcement learning to learn pouring with a real robot and a simulated dish rack task, outperforming other representations.
- 基于TCN的奖励使PILQR强化学习能够在真实机器人和仿真碗架任务中学习浇注,表现优于其他表征。
- Direct pose imitation via self-regression with a shared TCN embedding enables end-to-end imitation without joint-level pose labels, and can be augmented with limited human supervision.
- 通过自回归的共享TCN嵌入实现的直接姿态模仿,使端到端模仿在没有关节级姿态标签的情况下成为可能,并可通过有限的人类监督进行增强。
- The approach demonstrates strong qualitative results, including robust imitation from third-person videos and rapid task acquisition.
- 该方法展示了强烈的定性结果,包括来自第三人称视频的鲁棒模仿和快速任务获取。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。