QUICK REVIEW

[論文レビュー] DeepSight: Bridging Depth Maps and Language with a Depth-Driven Multimodal Model

Hao Yang, Hongbo Zhang|arXiv (Cornell University)|Mar 6, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

DeepSightは深度画像のエンコードをテキストと整合させる深度モーダルLLaMAを導入し、深度専用データセットを構築、Depth Template Benchmarkで深度理解の最先端を達成した。

ABSTRACT

Multimodal large language models (MLLMs) have achieved impressive performance across various tasks such as image captioning and visual question answer(VQA); however, they often struggle to accurately interpret depth information inherent in visual data. In this work, we introduce DeepSight, the first dedicated depth MLLM designed to enhance three-dimensional scene understanding. Unlike conventional methods that align RGB image encodings with text, our approach takes advantage of the unique characteristics of depth images: single-channel grayscale images where the pixel values directly reflect depth cues to improve spatial reasoning. To address challenges associated with limited depth data and the inadequacy of simple channel replication, we construct a novel depth image-text pair dataset and a depth instruction dataset. Depth maps are generated from visual images using the GLPN model, and GPT-4 is employed to curate corresponding depth instructions, an approach validated by LLaVA. Additionally, we modify the ViT encoder in CLIP to incorporate local object information, thereby capturing the subtle continuous variations of depth more effectively. To evaluate the performance of our model, we develop a comprehensive depth question answer benchmark based on existing depth image datasets, which rigorously assesses understanding in typical depth map scenarios. Experimental results demonstrate that DeepSight significantly enhances depth perception and downstream task performance, marking a substantial step forward in multimodal three-dimensional understanding.

研究の動機と目的

RGB中心のMLLMを超える3Dシーン理解のための深度対応マルチモーダルモデルの必要性を動機づける。
深度テキスト対ペアデータセットと深度指示データセットを構築し、深度に合わせたファインチューニングを可能にする。
局所的な深度手がかりとオブジェクト覆域情報を取り入れた修正CLIPベースの深度エンコーダを提案し、深度対応推論のために大規模言語モデルと整合させる。
新しいDepth Template Benchmarkを用いて深度理解を評価し、ベースラインより改善を示す。

提案手法

BBox畳み込み層を備えたCLIP ViTを修正し、局所深度とオブジェクトカバレッジ情報を取り込む。
深度エンコーダと線形射影層を訓練し、深度特徴をVicunaベースのLLMと2段階（整合→監視付き微調整）で整合させる。
GLPNを使ってCOCOのRGB画像を深度に変換し、深度関連キャプションを選択、GPT-3.5/GPT-4で指示を合成してDepth Instruction Datasetを生成。
現実の深度データセットとバウンディングボックス由来のオブジェクト深度に基づく4つのサブタスク（Scene Classification, Recognition, Distance Judge, Security）を含むDepth Template Benchmarkを作成。
訓練用に118kの深度テキスト‑バウンディングボックスペアと22kの深度指示を準備し、深度テキスト整合のための2段階整合とSFT戦略を適用。

実験結果

リサーチクエスチョン

RQ1深度情報を深度エンコーディングとテキスト表現を整合させることで、マルチモーダルLLMに効果的に統合できるか？
RQ2整合とSFTを伴う深度重視の訓練パイプラインは、RGB中心のベースラインと比較して深度知覚と3D推論タスクを改善するか？
RQ3提案されたDepth Template BenchmarkはMLLMsにおける深度理解をどれだけ正確に定量化できるか？
RQ4Bbox Convolution、データ採取戦略、深度指示データなどのアーキテクチャ選択は深度推論性能にどのような影響を与えるか？

主な発見

DeepSightのNYU-DおよびSUN-Dでのゼロショットのシーン分類精度は67.0%、38.4%で、ImageBindおよびLanguageBindを上回る。
Depth Template Benchmarkのゼロショット設定で、DeepSightは平均38.53%を達成し、PandaGPT-7BおよびImageBindLLM-7Bのゼロショットベースラインを上回る。
監視付き微調整では、DeepSight-7BはScene Classificationで64.86%、Recognitionで40.56%、Distance Judgmentで63.17%、Securityで44.81%となり、平均53.85%に達する。
アブレーション分析では、MLPとLLMを同時に微調整するとDistance Judgeで単独調整より大きな改善が得られ、トレーニング時および推論時のBbox Convolution層の保持はDistance Judgeを63.17%へ改善する。
Depth Instruction Datasetとデータサンプリング戦略は、評価対象のモデル全体の性能を大幅に向上させ、深度整合データで微調整後に顕著な効果をもたらす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。