QUICK REVIEW

[论文解读] CLIPort: What and Where Pathways for Robotic Manipulation

Mohit Shridhar, Lucas Manuelli|arXiv (Cornell University)|Sep 24, 2021

Multimodal Machine Learning Applications参考文献 65被引用 99

一句话总结

CLIPort 引入了一个两流、语言条件的操控框架，将来自 CLIP 的语义流与基于空间传输器的流融合，将语言 grounding 到细粒度动作，在仿真与实际机器人中实现了强大的少样本和多任务泛化。

ABSTRACT

How can we imbue robots with the ability to manipulate objects precisely but also to reason about them in terms of abstract concepts? Recent works in manipulation have shown that end-to-end networks can learn dexterous skills that require precise spatial reasoning, but these methods often fail to generalize to new goals or quickly learn transferable concepts across tasks. In parallel, there has been great progress in learning generalizable semantic representations for vision and language by training on large-scale internet data, however these representations lack the spatial understanding necessary for fine-grained manipulation. To this end, we propose a framework that combines the best of both worlds: a two-stream architecture with semantic and spatial pathways for vision-based manipulation. Specifically, we present CLIPort, a language-conditioned imitation-learning agent that combines the broad semantic understanding (what) of CLIP [1] with the spatial precision (where) of Transporter [2]. Our end-to-end framework is capable of solving a variety of language-specified tabletop tasks from packing unseen objects to folding cloths, all without any explicit representations of object poses, instance segmentations, memory, symbolic states, or syntactic structures. Experiments in simulated and real-world settings show that our approach is data efficient in few-shot settings and generalizes effectively to seen and unseen semantic concepts. We even learn one multi-task policy for 10 simulated and 9 real-world tasks that is better or comparable to single-task policies.

研究动机与目标

将抽象的语义概念（是什么）落地为操控所需的物理空间动作（在哪里）。
实现语言条件控制，使概念在不同任务之间可转移。
通过少量示例实现数据高效学习，并支持多任务学习。
展示从仿真到真实机器人在极少数据下的迁移能力。

提出的方法

采用两流架构：语义流由预训练的 CLIP 特征条件化，空间流处理 RGB-D 输入。
将操作表述为拾取与放置的可行动作预测，使用 Transporter 风格的 FCN 作为拾取与放置的 Q 函数。
让语义流以 CLIP 语言编码为条件，并将语言特征整合到解码层。
通过仿真演示的模仿学习进行训练，对像素级动作映射使用交叉熵损失。
使用两步行动原语（起始和末端执行器位姿），并采用平移等变网络。
通过在演示中对任务和属性进行随机化，扩展到多任务和未见属性的泛化。

实验结果

研究问题

RQ1相比单流或基线方法，语言条件双流架构在细粒度操作上的效果如何？
RQ2单一的多任务模型是否能在包含未见属性的多种语言条件任务上实现泛化？
RQ3语义属性（颜色、形状、对象类别）在已见与未见场景中的泛化程度如何？
RQ4在有限数据条件下，该方法从仿真迁移到真实世界机器人操作的效果如何？

主要发现

双流 CLIPort 的表现优于仅 Transporter 和仅 CLIP 的基线，在更少的演示下实现更高成功率（例如，单任务 CLIPport 在 100 次演示下超过 90%）。
在 10 个任务上训练的多任务 CLIPport 模型在许多任务上可匹配或超越单任务模型，展示了有效的跨任务泛化。
对于已见属性，CLIPort（单一）表现良好；对于未见属性，定位更困难，但在多任务设置中的显式迁移（CLIPort multi-attr）显著提升了性能。
在现实世界机器人实验中，使用大约 179 对图像-动作对训练的多任务模型在 9 个任务上取得有意义的成功，简单任务的性能约为 70%。
未见属性总体上导致性能较低，但在跨任务利用语义转移时收益显现（例如，粉色块有助于解决未见颜色的任务）。
该框架在少样本条件下展现出数据效率，支持训练一个可用于多任务的单一策略，其性能可与或优于单任务策略。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。