QUICK REVIEW

[論文レビュー] 3D-LLM: Injecting the 3D World into Large Language Models

Yining Hong, Haoyu Zhen|arXiv (Cornell University)|Jul 24, 2023

Multimodal Machine Learning Applications被引用数 39

ひとこと要約

本論文は、3D点群とその特徴を取り込み、3D指向の多様なタスクを実行する3D-LLMを提案します。3D言語データパイプラインと2D VLMバックボーン、3D局在化機構を用いて訓練されます。

ABSTRACT

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.

研究の動機と目的

3D点群と特徴量を入力として受け取れるようにして、LLMを3Dワールドに grounding させる。
多様なタスク（キャプション、QA、グ grounding、対話、ナビゲーション）を網羅する3D言語データセットを作成・拡張する。
3D特徴を同じ特徴空間にマッピングして、2D事前学習VLMをバックボーンとして活用する。
3D空間の空間推論を改善する3D局在化機構を導入する。
3Dビジョン–言語のベンチマークで最先端のベースラインを上回る性能を示す。

提案手法

ChatGPTと3Dシーン情報を用いた3つの prompting パイプラインで、30万点以上の大規模な3D言語データを生成する。
Direct Reconstruction、Feature Fusion (gradslam)、または Neural Field法を用いてレンダリング済みの多視点画像から3D特徴を抽出し、<N, D_v> 3D特徴を組み立てる。
訓練はゼロからではなく、フレームワーク的アーキテクチャを持つ2D VLMバックボーン（例：Flamingo、BLIP-2）を用いて、3D特徴を処理する。
正弦波的位置埋め込みを付与して3D特徴を拡張し、LLM語彙へ位置トークンを導入して3D空間情報を符号化する。
言語モデリング損失で訓練し、保留データのScanQAおよび保留内の3Dタスク（キャプション、グ grounding、対話、タスク分解）で評価する。

実験結果

リサーチクエスチョン

RQ12D入力を超える3D表現を入力としたとき、言語モデルに基づく推論は3Dタスクで改善されるのか？
RQ23D-LLMの訓練のために大規模な3D言語データを効率的に生成・整合させるにはどうすればよいのか？
RQ33D局在化機構は3D空間理解とグ grounding を改善するのか？
RQ4ScanQAのような3D中心のベンチマークで、3D-LLMは2D VLMsやLLMベースラインを上回るのか？
RQ5最終的な3D-LLMの性能に対して、異なる3D特徴抽出戦略の影響はどの程度か？

主な発見

BLEU-1	BLEU-2	BLEU-3	BLEU-4	METEOR	ROUGE-L	CIDER	EM
VoteNet+MCAN*	28.0	16.7	10.8	6.2	11.4	29.8	54.7	17.3
ScanRefer+MCAN*	26.9	16.6	11.6	7.9	11.5	30	55.4	18.6
ScanQA*	30.2	20.4	15.1	10.1	13.1	33.3	64.9	21.0
LLaVA(zero-shot)	7.1	2.6	0.9	0.3	10.5	12.3	5.7	0.0
flamingo-SingleImage	23.8	14.5	9.2	8.5	10.7	29.6	52	16.9
flamingo-MultiView	25.6	15.2	9.2	8.4	11.3	31.1	55	18.0
BLIP2-flant5-SingleImage	28.6	15.1	9.0	5.1	10.6	25.8	42.6	13.3
BLIP2-flant5-MultiView	29.7	16.2	9.8	5.9	11.3	26.6	45.7	13.6
3D-LLM (flamingo)	30.3	17.8	12.0	7.2	12.2	32.3	59.2	20.4
3D-LLM (BLIP2-opt)	35.9	22.5	16.0	9.4	13.8	34.0	63.8	19.3
3D-LLM (BLIP2-flant5)	39.3	25.2	18.4	12.0	14.5	35.7	69.4	20.5

3D-LLMsはScanQAで最先端の結果を達成し、BLEU-1は以前の最高より約9%向上。
保留データセット（キャプション、グ grounding、対話、タスク分解）では、3D-LLMsは複数の指標で2D VLMsを上回る。
明示的なオブジェクト表現に依存せず、全体的な3D特徴を使って強力な性能を示す。
BLIP2-flant5バックボーンの3D-LLMはScanQA検証でBLEU-1が39.3、BLEU-4が25.2に達し、ベースラインを上回る。
指標全体で、BLIP2-flanT5およびBLIP2-optバックボーンの3D-LLMsは Flamingoベースの変種や単一視点ベースを上回る。
品質的な結果は、既存のLLMsおよびVLMsを超えるより広いタスク能力を示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。