QUICK REVIEW

[논문 리뷰] Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

Michael J. Ahn, Anthony Brohan|arXiv (Cornell University)|2022. 04. 04.

Multimodal Machine Learning Applications인용 수 512

한 줄 요약

본 논문은 SayCan으로, 대형 언어 모델(LLMs)을 로봇 공학에 grounding하는 프레임워크를 제시한다. 고수준 계획을 학습된 어포던스에서 얻은 정보와 연결하여 모바일 매니퓰레이터에서 실제 세계의 장기 지시 실행을 가능하게 한다.

ABSTRACT

Large language models can encode a wealth of semantic knowledge about the world. Such knowledge could be extremely useful to robots aiming to act upon high-level, temporally extended instructions expressed in natural language. However, a significant weakness of language models is that they lack real-world experience, which makes it difficult to leverage them for decision making within a given embodiment. For example, asking a language model to describe how to clean a spill might result in a reasonable narrative, but it may not be applicable to a particular agent, such as a robot, that needs to perform this task in a particular environment. We propose to provide real-world grounding by means of pretrained skills, which are used to constrain the model to propose natural language actions that are both feasible and contextually appropriate. The robot can act as the language model's "hands and eyes," while the language model supplies high-level semantic knowledge about the task. We show how low-level skills can be combined with large language models so that the language model provides high-level knowledge about the procedures for performing complex and temporally-extended instructions, while value functions associated with these skills provide the grounding necessary to connect this knowledge to a particular physical environment. We evaluate our method on a number of real-world robotic tasks, where we show the need for real-world grounding and that this approach is capable of completing long-horizon, abstract, natural language instructions on a mobile manipulator. The project's website and the video can be found at https://say-can.github.io/.

연구 동기 및 목표

LLMs가 현실 세계의 grounding이 부족하고 구현된 에이전트에 배포될 때 실패할 수 있음을 동기 부여한다.
사전 학습된 스킬의 세계 인식 어포던스와 함께 LLM 출력의 grounding을 제안한다.
로봇의 환경에서 실행 가능하고 해석 가능한 단계별 실행 계획을 가능하게 한다.
이동식 로봇이 수행하는 긴 horizon의 주방 작업에서 실제 세계 성능을 시연한다.

제안 방법

각 저수준 스킬을 정책과 TD-학습 가치 함수(어포던스)로 표현한다.
지시 i가 주어졌을 때 각 스킬 설명 ell_pi에 대해 p(ell_pi|i)를 LLM으로 계산한다.
상태 s에서의 성공 확률로서 스킬의 어포던스 p(c_pi|s,ell_pi)를 계산한다.
다음 스킬 pi를 선택하기 위해 p(c_pi|s,ell_pi) * p(ell_pi|i)로 점수를 결합한다.
선택된 스킬을 반복적으로 실행하고, 업데이트된 맥락으로 LLM을 재질의한다.
텍스트 임베딩으로 조건화된 다중 작업 설정을 갖는 행동 복제(BC) 또는 강화학습(RL)을 통해 언어 조건화 정책을 학습한다.

실험 결과

연구 질문

RQ1구현형 에이전트가 현실 세계의 어포던스에 LLM 지식을 grounding하여 고수준의 자연어 지시를 실행할 수 있는가?
RQ2LLM 안내 계획과 스킬 어포던스의 결합이 실제 로봇에서 계획 및 실행을 향상시키는가?
RQ3주방 환경에서 장기적이고 추상적인 작업으로의 확대가 어떻게 되는가?
RQ4다른 언어 모델과 grounding 구성요소가 성능에 어떤 영향을 미치는가?
RQ5새로운 스킬을 시스템에 추가하는 것이 미치는 영향은 무엇인가?

주요 결과

PaLM-SayCan는 모의 주방에서 84%의 계획 성공과 74%의 실행 성공을 달성한다.
실제 주방에서는 계획 81%, 실행 60%로 떨어져 현실 세계에 대한 합리적인 일반화를 보인다.
어포던스 grounding과 LLM 가이던스가 비-grounded 기준선에 비해 거의 두 배의 성능을 보인다.
더 큰 LLM이 성능을 향상시키며; PaLM(540B)가 전체 시스템에서 계획과 실행 모두에서 FLAN보다 우수하다.
집속 실험에서 언어 grounding과 어포던스 grounding 모두가 강한 성능에 필요함이 나타났다.
시스템은 새로운 스킬(예: 서랍 조작)을 쉽게 통합하고 기존 작업에서도 성능을 유지할 수 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.