QUICK REVIEW

[论文解读] Grasp2Vec: Learning Object Representations from Self-Supervised Grasping

Eric Jang, Coline Devin|arXiv (Cornell University)|Nov 16, 2018

Robot Manipulation and Learning参考文献 28被引用 74

一句话总结

Grasp2Vec 通过自监督的机器人抓取来学习面向对象的嵌入，通过强制抓取后场景差异等于被抓取对象的嵌入来实现，从而在没有标签的情况下实现定位、实例检测和面向目标的抓取。

ABSTRACT

Well structured visual representations can make robot learning faster and can improve generalization. In this paper, we study how we can acquire effective object-centric representations for robotic manipulation tasks without human labeling by using autonomous robot interaction with the environment. Such representation learning methods can benefit from continuous refinement of the representation as the robot collects more experience, allowing them to scale effectively without human intervention. Our representation learning approach is based on object persistence: when a robot removes an object from a scene, the representation of that scene should change according to the features of the object that was removed. We formulate an arithmetic relationship between feature vectors from this observation, and use it to learn a representation of scenes and objects that can then be used to identify object instances, localize them in the scene, and perform goal-directed grasping tasks where the robot must retrieve commanded objects from a bin. The same grasping procedure can also be used to automatically collect training data for our method, by recording images of scenes, grasping and removing an object, and recording the outcome. Our experiments demonstrate that this self-supervised approach for tasked grasping substantially outperforms direct reinforcement learning from images and prior representation learning methods.

研究动机与目标

促进机器人操作中面向对象的场景表示的自动化、自监督学习。

提出的方法

使用基于 ResNet-50 的 CNN 对场景和抓取对象进行嵌入，产生 phi_s 和 phi_o 表征。
施加一个算术约束：phi_s(s_pre) - phi_s(s_post) ≈ phi_o(o)，以捕捉对象的身份与持久性。
使用 n-pairs 损失进行训练，使场景差嵌入与对象嵌入对齐，并分离负样本。
利用学习到的 Grasp2Vec 嵌入通过空间热力图定位对象，并通过 Q 学习来条件化目标导向的抓取策略。
训练数据通过抓取序列（s_pre、s_post、o）自动收集。

实验结果

研究问题

RQ1自监督嵌入是否能从抓取中捕捉对象身份与场景中的对象集合？
RQ2Grasp2Vec 嵌入是否能在没有标签数据的情况下进行定位和区分对象实例？
RQ3是否可以在没有人工注释的情况下，使用 Grasp2Vec 嵌入派生的奖励来训练面向目标的抓取策略？
RQ4Grasp2Vec 在未见对象上的泛化能力如何，在仿真和真实世界中均是否有效？

主要发现

方法	sim seen	sim novel	real seen	real novel
检索（我们的方法）	88%	64%	89%	88%
结果邻居（ImageNet）	—	—	23%	22%
定位（我们的方法）	96%	77%	83%	81%
定位（ImageNet）	—	—	18%	15%

Grasp2Vec 检索准确度：88%（仿真已见对象），64%（仿真新颖对象），89%（真实世界已见对象），88%（真实世界新颖对象）。
Grasp2Vec 定位准确度：96%（仿真已见），77%（仿真新颖），83%（真实已见），81%（真实新颖）。
使用 ImageNet 特征的定位在同一任务上显著较差（约 15-18% 范围内）。
在仿真中，基于 Grasp2Vec 的 ES 奖励实现的实例抓取在已知对象上达到 78-83%，在未见对象上达到 53-59%，取决于消融实验。
在真实世界中，利用定位加不区分性抓取的实例抓取在训练对象上达到 80.8%，在测试对象上达到 62.9%。
通过可加法的 Grasp2Vec 嵌入实现的复合目标，在仿真中实现了一些多对象目标行为（如对某些复合目标，已知对象 51.9%，未见对象 42.9%）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。