QUICK REVIEW

[논문 리뷰] DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States for Autonomous Driving

Cui, Erfei, Wenhai Wang|arXiv (Cornell University)|2023. 12. 14.

Topic Modeling인용 수 21

한 줄 요약

DriveMLM은 모듈식 자율주행 행동 계획자와 정렬된 다중 모달 LLM 플래너를 도입하여 CARLA에서 폐쇄-루프 주행을 가능하게 하고 의사 결정에 대한 설명을 제공하며 Town05 Long에서 Apollo를 능가한다.

ABSTRACT

Large language models (LLMs) have opened up new possibilities for intelligent agents, endowing them with human-like thinking and cognitive abilities. In this work, we delve into the potential of large language models (LLMs) in autonomous driving (AD). We introduce DriveMLM, an LLM-based AD framework that can perform close-loop autonomous driving in realistic simulators. To this end, (1) we bridge the gap between the language decisions and the vehicle control commands by standardizing the decision states according to the off-the-shelf motion planning module. (2) We employ a multimodal LLM (MLLM) to model the behavior planning module of a module AD system, which uses driving rules, user commands, and inputs from various sensors (e.g., camera, lidar) as input and makes driving decisions and provide explanations; This model can plug-and-play in existing AD systems such as Autopilot and Apollo for close-loop driving. (3) We design an effective data engine to collect a dataset that includes decision state and corresponding explanation annotation for model training and evaluation. We conduct extensive experiments and show that replacing the decision-making modules of the Autopilot and Apollo with DriveMLM resulted in significant improvements of 3.2 and 4.7 points on the CARLA Town05 Long respectively, demonstrating the effectiveness of our model. We hope this work can serve as a baseline for autonomous driving with LLMs.

연구 동기 및 목표

언어 기반 의사결정과 차량 제어 사이의 격차를 LLM 출력과 행동 계획 상태에 맞춰 정렬한다.
다중 뷰 이미지, LiDAR, 규칙 및 지시를 입력으로 받아 주행 결정과 설명을 예측하는 다중 모달 LLM 플래너를 개발한다.
훈련 및 평가를 위한 의사결정 상태와 설명을 포함한 주행 데이터를 수집하는 효율적인 데이터 엔진을 만든다.
실감나는 시뮬레이터에서 폐쇄-루프 주행을 시연하고 표준 벤치마크에서 Apollo와 비교한다.

제안 방법

Apollo 스타일 플래너에서 실행 가능한 속도와 경로 결정으로 LLM 출력 매핑을 위한 행동 계획 상태 정렬.
이미지를 위한 시간적 QFormer, LiDAR를 위한 SPT+QFormer, 텍스트 임베딩을 포함하는 다중 모달 토크나이저를 LLM 디코더와 결합하여 의사결정 상태와 설명을 출력하는 MLLM 플래너.
시간적 다중 뷰 이미지를 효율적으로 처리하도록 설계된 Temporal QFormer로 선형 토큰 증가 없이 처리.
의사결정 상태와 설명을 포함한 280시간의 CARLA 데이터를 생성하는 데이터 엔진, 전문가 운전 및 주석 보강을 위한 GPT-3.5 사용.
ViT-g/14 비주얼 인코더, 이미지 토큰용 32 쿼리, LiDAR용 GD-MAE, AdamW 옵티마이저, 학습률 5e-5, 2 에폭, 배치 크기 256으로 LLaMA-7B를 사용하는 훈련 설정.
주행 점수(DS), 경로 완수(RC), 위반 점수(IS), 개입당 마일(MPI), 설명에 대한 NLP 기반 지표(BLEU-4, CIDEr, METEOR)를 포함한 평가 지표.

(a) Rule-Based Autonomous Driving System [ 3 ]

실험 결과

연구 질문

RQ1LLM 기반 플래너를 기존 행동 계획 의사결정 상태와 정렬시켜 폐쇄-루프 자율 주행을 가능하게 할 수 있는가?
RQ2다중 모달 LLM 플래너가 규칙 기반 FSM 기반선 대비 의사결정 정확도와 주행 안전성을 향상시키는가?
RQ3주행 의사결정과 그 설명이 언어 조건 프롬프트를 통해 해석 가능하고 제어 가능한가?
RQ4센서 모달리티(이미지, LiDAR)와 시간적 처리가 의사결정 정확도와 설명가능성에 어떤 영향을 미치는가?

주요 결과

방법	DS	RC	IS	MPI
Roach	43.6	80.4	0.54	-
Interfuser	68.3	95.0	0.72	0.70
ThinkTwice	70.9	95.5	0.75	0.40
Apollo	71.4	92.2	0.80	0.76
DriveMLM	76.1	98.1	0.78	0.96

DriveMLM은 CARLA Town05 Long에서 76.1 드라이빙 스코어를 달성하여 Apollo보다 DS 포인트 4.7 증가.
DriveMLM은 98.1 경로 완성 및 0.78 위반 점수, MPI 0.96으로 사람 개입이 적었음을 시사.
의사결정 예측 정확도와 의사결정 유형 F1 점수는 DriveMLM이 Apollo 및 이전 LLM 베이스라인보다 높다.
DriveMLM은 BLEU-4 0.89, CIDEr 0.91, METEOR 0.61로 고품질의 설명을 제공한다.
변형 실험은 Temporal QFormer와 함께 다중 뷰 이미징이 최상의 성능을 보여주었고(경로 F1 및 속도 F1 향상; 정확도 약 18.2% 증가).
DriveMLM은 nuScenes에서 제로샷 추론 및 지침의 유연한 영향(예: 구급차에 대한 양보나 교통 규칙 조정)을 시연한다.

(b) End-to-End Autonomous Driving System [ 25 , 27 , 57 ]

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.