QUICK REVIEW

[论文解读] Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto|arXiv (Cornell University)|Oct 16, 2023

Domain Adaptation and Few-Shot Learning被引用 10

一句话总结

SuSIE 使用一个预训练的图像编辑扩散模型来生成未来的子目标，并使用一个低级别的目标条件策略来实现它们，从而实现零样本语言条件的机器人操作，具备强泛化能力。

ABSTRACT

If generalist robots are to operate in truly unstructured environments, they need to be able to recognize and reason about novel objects and scenarios. Such objects and scenarios might not be present in the robot's own training data. We propose SuSIE, a method that leverages an image-editing diffusion model to act as a high-level planner by proposing intermediate subgoals that a low-level controller can accomplish. Specifically, we finetune InstructPix2Pix on video data, consisting of both human videos and robot rollouts, such that it outputs hypothetical future "subgoal" observations given the robot's current observation and a language command. We also use the robot data to train a low-level goal-conditioned policy to act as the aforementioned low-level controller. We find that the high-level subgoal predictions can utilize Internet-scale pretraining and visual understanding to guide the low-level goal-conditioned policy, achieving significantly better generalization and precision than conventional language-conditioned policies. We achieve state-of-the-art results on the CALVIN benchmark, and also demonstrate robust generalization on real-world manipulation tasks, beating strong baselines that have access to privileged information or that utilize orders of magnitude more compute and training data. The project website can be found at http://rail-berkeley.github.io/susie .

研究动机与目标

激发通用型机器人在训练中未遇到的新物体和新情景下的运行能力。
利用预训练的图像编辑扩散模型，从语言指令提供高级子目标规划。
训练一个低层次的目标条件策略，使用机器人数据实现对子目标的到达，从而实现鲁棒的零样本泛化。
展示在真实世界操作任务和 CALVIN 基准测试中的泛化能力和精度的提升。

提出的方法

在带语言标签的视频数据上微调 Instruct Pix2Pix，以在给定当前观测和语言指令的情况下输出假设的未来子目标观测。
通过行为克隆训练一个低层次的目标条件策略，在 k_max 步内达到生成的子目标。
在测试时迭代生成子目标并使用低层次策略执行短执行（每个子目标 k_test 步）。
在子目标生成过程中，使用无分类器引导来使扩散模型同时受语言和图像输入的条件约束。
采用基于扩散的策略，预测带时间平均的动作块以提高鲁棒性。
通过将高层次的子目标合成与低层次控制解耦，依赖零样本规划，无需任务特定数据。

实验结果

研究问题

RQ1在零样本设置中，SuSIE 是否能在包含未见对象和语言指令的全新环境中完成任务？
RQ2与不含子目标的语言条件策略相比，子目标导向的规划是否能提高精确性和灵活性？
RQ3互联网级预训练和视频共训练对零样本泛化有多重要？
RQ4与强基线相比，SuSIE 在真实世界操作任务中的表现如何？

主要发现

SuSIE 在 CALVIN 上达到最新的零样本性能（从 A–C 训练到 D 测试）。
在真实世界场景中，SuSIE 超越 RT-2-X、UniPi、LCBC 等基线，尤其是在存在新奇干扰物和物体的场景中。
子目标引导提升了低层次操作的精度，使在抓取灯笼椒等具有挑战性的任务中也能成功。
互联网预训练和视频数据的共训练显著提升了子目标质量和零样本泛化。
对子目标模型进行 Something-Something 数据的共训练在未见场景（场景 B 和 C）上提升表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。