QUICK REVIEW

[논문 리뷰] ViNT: A Foundation Model for Visual Navigation

Dhruv Shah, Ajay Sridhar|arXiv (Cornell University)|2023. 06. 26.

Multimodal Machine Learning Applications인용 수 14

한 줄 요약

ViNT는 다양한 실제 데이터세트에서 학습된 Transformer 기반의 시각 네비게이션 기반 모델로, 로봇과 환경 전반에 걸친 제로샷 일반화를 가능하게 하며; 확산 기반의 서브목표 제안으로 안내될 수 있고 새로운 작업 모달성에 대해 미세 조정할 수 있다.

ABSTRACT

General-purpose pre-trained models ("foundation models") have enabled practitioners to produce generalizable solutions for individual machine learning problems with datasets that are significantly smaller than those required for learning from scratch. Such models are typically trained on large and diverse datasets with weak supervision, consuming much more training data than is available for any individual downstream application. In this paper, we describe the Visual Navigation Transformer (ViNT), a foundation model that aims to bring the success of general-purpose pre-trained models to vision-based robotic navigation. ViNT is trained with a general goal-reaching objective that can be used with any navigation dataset, and employs a flexible Transformer-based architecture to learn navigational affordances and enable efficient adaptation to a variety of downstream navigational tasks. ViNT is trained on a number of existing navigation datasets, comprising hundreds of hours of robotic navigation from a variety of different robotic platforms, and exhibits positive transfer, outperforming specialist models trained on singular datasets. ViNT can be augmented with diffusion-based subgoal proposals to explore novel environments, and can solve kilometer-scale navigation problems when equipped with long-range heuristics. ViNT can also be adapted to novel task specifications with a technique inspired by prompt-tuning, where the goal encoder is replaced by an encoding of another task modality (e.g., GPS waypoints or routing commands) embedded into the same space of goal tokens. This flexibility and ability to accommodate a variety of downstream problem domains establishes ViNT as an effective foundation model for mobile robotics. For videos, code, and model checkpoints, see our project page at https://visualnav-transformer.github.io.

연구 동기 및 목표

로봇 구현체와 환경 간에 특정 작업 훈련 없이도 전이 가능한 일반적인 사전 학습 시각 네비게이션 정책을 만들려는 목표.
egocentric 시각 관찰을 이용해 이미지-목표 하위목표에 도달하는 방식으로 내비게이션을 학습한다.
다운스트림 네비게이션 모 modalities(예: GPS, 라우팅 명령)에 대한 제로샷 배치 및 효율적 미세 조정을 가능하게 한다.
광범위하고 이질적인 실제 데이터셋을 활용해 광범위한 내비게이션 편향과 emergent 행동을 유도한다.

제안 방법

과거 관측치와 목표 이미지를 토큰화하고 상대 목표 표현을 위한 전용 목표 융합 인코더를 갖춘 31M 매개변수 Transformer 기반 아키텍처를 사용한다.
최대 우도 objective로 엔드 투 엔드 학습하여 미래 행동 시퀀스와 목표에 대한 동적 거리를 예측한다.
로봇의 최고 속도로 정규화된 상대 웨이포인트를 기반으로 한 구현체-독립적 행동 공간을 채택하고, 실행에는 PD 제어기를 사용한다.
ViNT를 이용해 시간적 거리와 행동을 계산하여 서브목표를 확산 기반으로 제안하고, 장기 시야 탐색을 위한 공간적 grounding을 가능하게 한다.
장면 밖 환경에서의 장기 시나리오 계획과 탐색을 지원하기 위해 에피소드 기억으로서 위상 그래프 플래너를 통합한다.
다음 작업 모달리티로의 적응성을 보여주기 위해 경량 프롬프트 유사 메커니즘으로 새로운 작업 모달리티를 ViNT 목표 토큰 공간으로 매핑하고, 필요 시 소규모 작업별 데이터셋으로 전체 모델을 엔드투엔드 미세 조정할 수 있다.

실험 결과

연구 질문

RQ1ViNT가 시각 네비게이션에 대해 새로운 로봇과 환경으로 제로샷 일반화할 수 있는가?
RQ2확산 기반 서브목표 제안 및 위상 플래너와의 결합이 장기 탐색에서 얼마나 잘 작동하는가?
RQ3제한된 데이터로 새로운 작업 모달리티(예: GPS 웨이포인트, 라우팅 방향)에 대해 ViNT를 얼마나 효과적으로 미세 조정하거나 적응시킬 수 있는가?
RQ4ViNT가 강건한 emergent 네비게이션 행동을 보이고 미지의 작업에 네비게이션 편향을 전달하는가?

주요 결과

ViNT는 학습 중 보지 못한 Go 1 사족보행을 포함한 여러 로봇과 환경에서 강력한 제로샷 일반화를 달성한다.
확산 기반 서브목표 제안 및 위상 플래너와 결합될 때 ViNT는 실내 및 실외 목표 도달 작업에서 기준모델을 능가한다(표 1).
실내 GPS 및 실외 위성 맥락에서 ViNT는 높은 성공률을 달성하며(실내 0.90, 실외 0.95–1.00) 경로 품질에 이로운 수치를 보인다(예: 91m 실내 거리; 1270m 실외, SPL 0.84; 1040m 실외, SPL 0.94).
1시간의 온-태스크 데이터로 ViNT를 미세 조정하는 것만으로도 새로운 도메인(예: CARLA의 자율 주행) 및 새로운 모달리티(Images, Positions, Routing)에서 강력한 성능을 얻을 수 있다.
새로운 모달리티에 경량 매핑으로 공유 목표 토큰 공간에 적응시키고 전체 엔드투엔드 미세 조정을 통해 작업 성능을 향상시킬 수 있다.
emergent 행동에는 암묵적인 충돌 회피 기본 동작, 내재된 네비게이션 선호(예: 도로를 따라 걷기, 복도에 머물기)가 포함되며 동적 보행자에 대한 강건성이 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.