QUICK REVIEW

[論文レビュー] LiDAR-LLM: Exploring the Potential of Large Language Models for 3D LiDAR Understanding

Senqiao Yang, Jiaming Liu|arXiv (Cornell University)|Dec 21, 2023

Multimodal Machine Learning Applications被引用数 13

ひとこと要約

LiDAR-LLM は View-Aware Transformer を用い、凍結された大規模言語モデルと3D LiDAR データを整合させ、屋外の LiDAR シーンに対して 3D キャプション生成、グラウンディング、及び高レベル指示追従を行う3段階の学習戦略を実現するフレームワーク。

ABSTRACT

Recently, Large Language Models (LLMs) and Multimodal Large Language Models (MLLMs) have shown promise in instruction following and 2D image understanding. While these models are powerful, they have not yet been developed to comprehend the more challenging 3D physical scenes, especially when it comes to the sparse outdoor LiDAR data. In this paper, we introduce LiDAR-LLM, which takes raw LiDAR data as input and harnesses the remarkable reasoning capabilities of LLMs to gain a comprehensive understanding of outdoor 3D scenes. The central insight of our LiDAR-LLM is the reformulation of 3D outdoor scene cognition as a language modeling problem, encompassing tasks such as 3D captioning, 3D grounding, 3D question answering, etc. Specifically, due to the scarcity of 3D LiDAR-text pairing data, we introduce a three-stage training strategy and generate relevant datasets, progressively aligning the 3D modality with the language embedding space of LLM. Furthermore, we design a View-Aware Transformer (VAT) to connect the 3D encoder with the LLM, which effectively bridges the modality gap and enhances the LLM's spatial orientation comprehension of visual features. Our experiments show that LiDAR-LLM possesses favorable capabilities to comprehend various instructions regarding 3D scenes and engage in complex spatial reasoning. LiDAR-LLM attains a 40.9 BLEU-1 on the 3D captioning task and achieves a 63.1\% classification accuracy and a 14.3\% BEV mIoU on the 3D grounding task. Web page: https://sites.google.com/view/lidar-llm

研究の動機と目的

屋外の3Dシーン理解を言語モデル化の問題として再構成し、LLMの推論を活用する。
LiDAR特徴とテキスト埋め込みの間にクロスメディアブリッジを構築する。
進行的なモダリティ整合を可能にするLiDAR-テキストペアデータセットを作成する。
屋外LiDARデータに対して3Dキャプション生成、グラウンディング、及び高レベル指示を可能にする。

提案手法

3D LiDARエンコーダを6つのビュー位置埋め込みを介してLLMに接続する View-Aware Transformer (VAT) を導入する。
3段階の学習戦略を提案する：クロスメディア整合（3Dキャプション）、知覚（視覚グラウンディングとグラウンディング付きキャプション）、高レベル指示微調整。
LiDAR-テキストペアデータセットを生成（キャプション420K、グラウンディング280K）し、主要モジュールを凍結したままアダプターをファインチューニングする。
VATと学習可能クエリ（K=576）を用いて、LiDAR BEV特徴を事前学習済みLLM（LLaMA-7B）の語彙埋め込み空間に射影する。
3D特徴抽出器（CenterPoint-Voxel）を使用し、処理効率のためにz方向にフラット化してBEVとする。
3Dキャプション、3Dグラウンディング、NuScenes-QAスタイルの高レベル指示タスクで評価し、VAT コンポーネントと学習段階のアブレーションを伴う評価を行う。

実験結果

リサーチクエスチョン

RQ1屋外のスパースなLiDARデータを、3D表現をLLMの言語空間と整合させることで効果的に解釈できるか。
RQ2ビュー基盤位置埋め込みを持つ View-Aware Transformer の導入は、LiDAR-to-textの整合における空間推論と局在化を改善するか。
RQ33段階の学習パイプライン（整合、知覚、高レベル指示）が、3Dキャプション、グラウンディング、VQA風タスクの性能にどう影響するか。
RQ4屋外の3Dシーンにおけるクロスメディア学習のための LiDAR-テキストペアデータの利点は何か。
RQ5計画専用データなしで、システムが自動運転における計画様の推論を示せるか。

主な発見

タスク	モデル	BLEU-1	BLEU-2	BLEU-3	BLEU-4	Bertスコア
3D Captioning	Mini-GPT4	14.97	6.76	3.74	2.63	84.38
3D Captioning	LLaVA1.5	19.92	12.10	8.57	5.37	85.01
3D Captioning	Instruct-BLIP	18.67	13.38	7.41	5.20	85.89
3D Captioning	LLaMA-AdapterV2	30.17	17.34	10.40	7.45	86.45
3D Captioning	Ours	40.98	29.96	23.43	19.26	91.32

LiDAR-LLM は nu-Caption の 3D キャプションで BLEU-1 が 40.9 を達成。
3Dグラウンディングでは Car BEV で 63.1% の分類精度と 14.3% の BEV mIoU を達成。
LiDAR-LLM は BLEU と Bert Score 指標で3Dキャプショニングのベンチマークで2D MLLMsを上回る。
アブレーションは、ビュー位置埋め込みを伴うVATがベースラインよりBLEU-4とBertScoreを改善し、3段階の学習がnuScenes-QAの高レベル指示性能を向上させることを示す。
本モデルはゼロショットの計画能力と、段階的な学習アプローチを通じた効果的なクロスメディア整合を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。