QUICK REVIEW

[논문 리뷰] Vision Language Models in Autonomous Driving: A Survey and Outlook

Xingcheng Zhou, Mingyu Liu|arXiv (Cornell University)|2023. 10. 22.

Multimodal Machine Learning Applications인용 수 17

한 줄 요약

본 논문은 Vision-Language Models (VLMs)를 Autonomous Driving (AD) 및 Intelligent Transportation Systems (ITS)에서 조사하고, 모델, 데이터셋, 응용 분야 및 향후 과제를 분류합니다.

ABSTRACT

The applications of Vision-Language Models (VLMs) in the field of Autonomous Driving (AD) have attracted widespread attention due to their outstanding performance and the ability to leverage Large Language Models (LLMs). By incorporating language data, driving systems can gain a better understanding of real-world environments, thereby enhancing driving safety and efficiency. In this work, we present a comprehensive and systematic survey of the advances in vision language models in this domain, encompassing perception and understanding, navigation and planning, decision-making and control, end-to-end autonomous driving, and data generation. We introduce the mainstream VLM tasks in AD and the commonly utilized metrics. Additionally, we review current studies and applications in various areas and summarize the existing language-enhanced autonomous driving datasets thoroughly. Lastly, we discuss the benefits and challenges of VLMs in AD and provide researchers with the current research gaps and future trends.

연구 동기 및 목표

Vision-Language Models가 Autonomous Driving 및 Intelligent Transportation Systems에서 어떻게 활용되는지에 대한 포괄적 개요를 제공합니다.
VLM 아키텍처 및 입력-출력 양식(M2T, M2V, V2T)과 상호 모달성 전략(VTF vs VTM)을 분류합니다.
AD/ITS에서 VLMs를 활용한 기존 데이터셋과 작업을 요약합니다.
VLM-enabled AD/ITS의 향후 연구를 안내하기 위한 현재 과제, 격차 및 개방된 연구 방향을 식별합니다.

제안 방법

자율 주행, ITS, LLMs 및 VLMs에 대한 기초 배경을 소개합니다.
입력-출력 양식과 상호 모달 연결(VTF vs VTM)을 기반으로 AD/ITS의 VLMs에 대한 분류 체계를 제안합니다.
VLMs를 활용하는 AD(지각, 항법, 의사결정, 엔드-투-엔드, 데이터 생성) 및 ITS(지각, ITS 관리)의 기존 연구를 체계적으로 검토합니다.
도메인에서 사용되는 데이터셋과 작업(예: 이미지/비디오, 텍스트, 포인트 클라우드 데이터) 및 수행된 분석 유형을 요약합니다.
AD/ITS를 위한 VLMs의 지속적이고 향후 연구를 뒷받침하기 위한 도전 과제, 연구 격차 및 향후 방향을 논의합니다.

실험 결과

연구 질문

RQ1자율 주행 및 Intelligent Transportation Systems에서 현재 사용되는 Vision-Language Model 아키텍처와 입력-출력 양식은 무엇입니까?
RQ2AD/ITS 내에서 VLMs가 지각, 항법, 계획, 의사결정, 엔드-투-엔드 주행 및 데이터 생성 작업에 어떻게 통합되어 있습니까?
RQ3AD/ITS에서 VLM을 평가하는 데 가장 널리 사용되는 데이터셋, 작업 및 벤치마크는 무엇입니까?
RQ4AD/ITS에서 VLM의 채택과 발전을 저해하는 주요 과제와 격차는 무엇이며, 향후 연구에 대한 유망한 방향은 무엇입니까?

주요 결과

이 연구는 Autonomous Driving 및 ITS에서 Vision-Language Models에 대한 최초의 포괄적 조사를 제공합니다.
이 연구는 AD 및 ITS 전반의 기존 VLM 연구와 데이터셋을 체계적으로 요약하고 분석합니다.
이 연구는 AD 및 ITS에서 VLM의 잠재적 응용 및 기술 발전을 식별합니다.
이 연구는 향후 탐색과 개발을 안내하기 위해 도메인의 과제와 연구 격차를 논의합니다.
이 연구는 VLMs의 분류 체계(M2T, M2V, V2T)와 상호 모달성 전략(Vision-Text-Fusion vs Vision-Text-Matching)을 명확히 합니다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.