QUICK REVIEW

[논문 리뷰] GNFactor: Multi-Task Real Robot Learning with Generalizable Neural Feature Fields

Yanjie Ze, Ge Yan|arXiv (Cornell University)|2023. 08. 31.

Domain Adaptation and Few-Shot Learning인용 수 11

한 줄 요약

GNFactor는 시각-언어 특징에서 증류된 공유 3D 체적 표현(GNF)을 사용하여 언어-조건의 다중 작업 조작 정책을 학습하고, 제한된 시연으로 실제 로봇과 시뮬레이션에서의 일반화를 가능하게 한다.

ABSTRACT

It is a long-standing problem in robotics to develop agents capable of executing diverse manipulation tasks from visual observations in unstructured real-world environments. To achieve this goal, the robot needs to have a comprehensive understanding of the 3D structure and semantics of the scene. In this work, we present $ extbf{GNFactor}$, a visual behavior cloning agent for multi-task robotic manipulation with $ extbf{G}$eneralizable $ extbf{N}$eural feature $ extbf{F}$ields. GNFactor jointly optimizes a generalizable neural field (GNF) as a reconstruction module and a Perceiver Transformer as a decision-making module, leveraging a shared deep 3D voxel representation. To incorporate semantics in 3D, the reconstruction module utilizes a vision-language foundation model ($ extit{e.g.}$, Stable Diffusion) to distill rich semantic information into the deep 3D voxel. We evaluate GNFactor on 3 real robot tasks and perform detailed ablations on 10 RLBench tasks with a limited number of demonstrations. We observe a substantial improvement of GNFactor over current state-of-the-art methods in seen and unseen tasks, demonstrating the strong generalization ability of GNFactor. Our project website is https://yanjieze.com/GNFactor/ .

연구 동기 및 목표

비구조화된 실제 세계 환경에서 시각 관찰로부터 강인하고 언어 조건의 다중 작업 조작을 유도한다.
제한된 시연에서 일반화 성능을 향상시키기 위해 지각과 행동 모듈이 공유하는 3D 보셀 기반 표현(GNF)을 제안한다.
기반 모델의 시각-언어 의미 특성을 3D 표현에 통합하여 장면 이해 및 작업 수행을 향상시킨다.
실제 세계와 RLBench 전체에서의 일반화를 시연하고, GNFactor를 최신 baselines와 비교한다.

제안 방법

관측치를 공유 체적 특징 v로 인코딩된 3D 보셀 격자(100^3)로 표현한다.
확산 기반 기반 모델로부터 RGB 뷰와 시각-언어 임베딩을 재구성하기 위해 일반화 가능한 신경 특징 필드(GNF)를 학습한다.
3D 특징, 고유감각, 언어 임베딩을 행동 결정으로 매핑하기 위해 Perceiver Transformer를 사용한다.
공동 목적어로 학습: GNF 재구성 손실(RGB 및 확산 특징)과 변위, 회전, 그리퍼, 충돌 헤드를 포함하는 크로스 엔트로피 액션 손실.
CLIP 기반 언어 특징으로 작업 지시를 기반화하여 작업 임베딩 T를 생성하고 정책을 조건화한다.

Figure 1: Left: Three camera views used in the real robot setup to reconstruct the feature field generated by Stable Diffusion [ 5 ] . We segment the foreground feature for better illustration. Right: Three language-conditioned real robot tasks across two different kitchens.

실험 결과

연구 질문

RQ1GNFactor가 제한된 시연에서 RLBench의 시뮬레이션 다중 작업에서 baselines를 능가할 수 있는가?
RQ2GNFactor가 시뮬레이션 및 그 이상에서 보지 못한 장면과 작업으로 일반화하는가?
RQ3GNFactor가 소란스러운 데이터로도 서로 다른 주방에서 실제 로봇 조작을 견고하게 수행하는가?
RQ4GNF, 확산 특징, 깊이 유도 샘플링, 스킵 연결 중 어떤 구성 요소가 성능과 일반화에 가장 큰 영향을 미치는가?

주요 결과

Method / Task	close jar	open drawer	sweep to dustpan	meat off grill	turn tap	Average
PerAct	18.7±8.2	54.7±18.6	0.0±0.0	40.0±17.0	38.7±6.8
PerAct (4 Cameras)	21.3±7.5	44.0±11.3	0.0±0.0	65.3±13.2	46.7±3.8
GNFactor	25.3±6.8	76.0±5.7	28.0±15.0	57.3±18.9	50.7±8.2	50.7

GNFactor는 다중 작업 RLBench 과제에서 PerAct를 능가하여 보이는 작업에서 평균 1.55배, 일반화 작업에서 1.57배의 개선을 보인다.
GNFactor는 예를 들어 열려진 서랍에서의 성공률이 76.0% 대 54.7%, RLBench 변형 비교에서 쓰레받다로 청소로의 스윕은 28.0% 대 0.0% 등 과제를 가로질러 더 높은 성공률을 달성한다.
두 개의 주방에서의 실제 로봇 실험에서 GNFactor는 더 높은 평균 성공률을 달성하고 환경이 바뀌어도 성능을 유지하는 반면, 기준선은 그렇지 않다.
삭제 실험은 GNF 재구성, 확산 특징, 깊이 유도 샘플링, 스킵 연결이 모두 성능에 기여함을 보여주며, RGB 목표나 확산 특징을 제거하면 결과가 저하된다.
GNFactor를 통한 View 합성은 PSNR 분석으로 타당하고, Grad-CAM 시각화는 정책이 3D 공간에서 대상 객체에 주의를 기울임을 나타낸다.

Figure 2: Simulation environments and the real robot setup. We show the RGB observations for our 10 RLBench tasks in Figure (a), the sampled views for GNF in Figure (b), and the real robot setup in Figure (c).

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.