Skip to main content
QUICK REVIEW

[论文解读] GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Yanjie Ze, Ge Yan|arXiv (Cornell University)|Aug 31, 2023
Domain Adaptation and Few-Shot Learning被引用 11
一句话总结

GNFactor 学习一种语言条件的多任务操控策略,使用来自视觉-语言特征蒸馏的共享三维体积表示(GNF),实现现实机器人和仿真在有限示例下的泛化。

ABSTRACT

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $ extbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $ extbf{G}$eneralizable $ extbf{N}$eural feature $ extbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($ extit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

研究动机与目标

  • 在非结构化真实世界环境中,通过视觉观测,激发鲁棒的、语言条件化的多任务操控。
  • 提出一种基于三维体素的表示(GNF),由感知与执行模块共享,以在有限演示下提升泛化能力。
  • 将基础模型中的视觉-语言语义特征并入三维表示,以增强场景理解与任务执行。
  • 展示真实世界和 RLBench 范围内的泛化,并将 GNFactor 与最先进的基线进行比较。

提出的方法

  • 将观测表示为三维体素网格(100^3),编码为一个共享体积特征 v。
  • 学习一个可泛化的神经特征场(GNF),从基于扩散的基础模型重建 RGB 视图和视觉-语言嵌入。
  • 使用 Perceiver Transformer 将三维特征、本体感知和语言嵌入映射到动作决策。
  • 采用联合目标训练:GNF 重构损失(RGB 和扩散特征)以及跨平移、旋转、夹爪和碰撞头的交叉熵动作损失。
  • 用基于 CLIP 的语言特征对任务指令进行地面化处理,以产生任务嵌入 T 并对策略进行条件化。
Figure 1: Left: Three camera views used in the real robot setup to reconstruct the feature field generated by Stable Diffusion [ 5 ] . We segment the foreground feature for better illustration. Right: Three language-conditioned real robot tasks across two different kitchens.
Figure 1: Left: Three camera views used in the real robot setup to reconstruct the feature field generated by Stable Diffusion [ 5 ] . We segment the foreground feature for better illustration. Right: Three language-conditioned real robot tasks across two different kitchens.

实验结果

研究问题

  • RQ1在有限演示的情况下,GNFactor 能否在仿真 RLBench 的多任务任务中超越基线?
  • RQ2在有限数据下,GNFactor 是否能对未见场景与任务在仿真及更广域中实现泛化?
  • RQ3GNFactor 能否在不同厨房环境下对真实机器人进行鲁棒操作,即使数据存在噪声?
  • RQ4哪些组件(GNF、扩散特征、深度引导采样、跳跃连接)对性能和泛化的影响最大?

主要发现

方法 / 任务关紧罐子打开抽屉扫入簸箕肉从烤架上取下转动水龙头平均值
PerAct18.7±8.254.7±18.60.0±0.040.0±17.038.7±6.8
PerAct (4 Cameras)21.3±7.544.0±11.30.0±0.065.3±13.246.7±3.8
GNFactor25.3±6.876.0±5.728.0±15.057.3±18.950.7±8.250.7
  • GNFactor 在多任务 RLBench 任务上优于 PerAct,已见任务平均提升 1.55 倍,泛化任务提升 1.57 倍。
  • GNFactor 在各任务上获得更高的成功率,例如在 RLBench 变体比较中,open drawer 为 76.0% 对 54.7%,sweep to dustpan 为 28.0% 对 0.0%。
  • 在跨两个厨房的真实机器人实验中,GNFactor 达到更高的平均成功率,并在环境变化时保持性能,与基线不同。
  • 消融研究表明,GNF 重构、扩散特征、深度引导采样和跳跃连接都对性能有贡献;去掉 RGB 目标或扩散特征会降低结果。
  • 通过 GNFactor 的视图合成在 PSNR 分析下是可行的,Grad-CAM 可视化表明策略在三维空间关注目标物体。
Figure 2: Simulation environments and the real robot setup. We show the RGB observations for our 10 RLBench tasks in Figure (a), the sampled views for GNF in Figure (b), and the real robot setup in Figure (c).
Figure 2: Simulation environments and the real robot setup. We show the RGB observations for our 10 RLBench tasks in Figure (a), the sampled views for GNF in Figure (b), and the real robot setup in Figure (c).

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。