QUICK REVIEW

[論文レビュー] Weakly Supervised 3D Open-vocabulary Segmentation

Kunhao Liu, Fangneng Zhan|arXiv (Cornell University)|May 23, 2023

Multimodal Machine Learning Applications被引用数 7

ひとこと要約

この論文は、オープンボキャブラリCLIPとDINOの知識をNeRFに蒸留し、セグメンテーション注釈なしで複数視点の画像とテキスト説明から3Dオープンボキャブラリ分割を実現する。特定のシーンでは一部の完全監視ベースラインを上回る。

ABSTRACT

Open-vocabulary segmentation of 3D scenes is a fundamental function of human perception and thus a crucial objective in computer vision research. However, this task is heavily impeded by the lack of large-scale and diverse 3D open-vocabulary segmentation datasets for training robust and generalizable models. Distilling knowledge from pre-trained 2D open-vocabulary segmentation models helps but it compromises the open-vocabulary feature as the 2D models are mostly finetuned with close-vocabulary datasets. We tackle the challenges in 3D open-vocabulary segmentation by exploiting pre-trained foundation models CLIP and DINO in a weakly supervised manner. Specifically, given only the open-vocabulary text descriptions of the objects in a scene, we distill the open-vocabulary multimodal knowledge and object reasoning capability of CLIP and DINO into a neural radiance field (NeRF), which effectively lifts 2D features into view-consistent 3D segmentation. A notable aspect of our approach is that it does not require any manual segmentation annotations for either the foundation models or the distillation process. Extensive experiments show that our method even outperforms fully supervised models trained with segmentation annotations in certain scenes, suggesting that 3D open-vocabulary segmentation can be effectively learned from 2D images and text-image pairs. Code is available at \url{https://github.com/Kunhao-Liu/3D-OVS}.

研究の動機と目的

Diverse labelsを有する3Dデータセットが不足しているため、オープンボキャブラリな3Dシーン分割を動機づける。
CLIPとDINOからの注釈なしのNeRF蒸留フレームワークを提案する。
画像レベルのCLIP特徴をピクセルレベルの3D分割へ適応するメカニズム（3D Selection Volume、マルチスケールパッチ）を開発する。
CLIPの曖昧さをRelevancy-Distribution Alignment（RDA）で緩和し、DINOベースの境界情報をFeature-Distribution Alignment（FDA）で蒸留する。
注釈なしで長尾クラスを含む強力な3Dオープンボキャブラリ分割性能を示す。

提案手法

画像パッチからのマルチスケールピクセルレベルのCLIP特徴を3D Selection Volumeで各3D点ごとに適切なスケールを選択して作成する。
レイに沿ってRGBとCLIP特徴をレンダリングし、レンダリングされたCLIP特徴とクラステキスト特徴のコサイン類似度で分割ロジットを計算する。
セグメンテーション確率を正規化されたクラス関連度マップと整合させるRelevancy-Distribution Alignment（RDA）損失を導入する。
DINOベースのシーンレイアウトと境界を反映させ、類似/非類似特徴の再バランス重みで分布を整えるFeature-Distribution Alignment（FDA）損失を導入する。
セグメンテーション注釈なしでRGB再投影と特徴コサイン類似度、LDA風の整合化損失を組み合わせた教師あり学習を行う。

実験結果

リサーチクエスチョン

RQ12Dの画像-テキストデータから manual segmentation annotationsなしで3Dオープンボキャブラリ分割を学習できるか。
RQ2ファインチューニングなしでCLIP画像レベル特徴を3D NeRF分割のピクセル精度へどうやって引き上げるか。
RQ3CLIPとDINO特徴を robust な3D分割へ整合させる最適な損失とメカニズムは何か。
RQ4提案手法は3Dシーンの長尾オブジェクトクラスに対してどう機能するか。
RQ5限られた入力ビューとスケールの使用が分割品質に与える影響は何か。

主な発見

Method	bed mIoU	bed Accuracy	sofa mIoU	sofa Accuracy	lawn mIoU	lawn Accuracy	room mIoU	room Accuracy	bench mIoU	bench Accuracy	table mIoU	table Accuracy
2D LSeg	56.0	87.6	4.5	16.5	17.5	77.5	19.2	46.1	6.0	42.7	7.6	29.9
ODISE	52.6	86.5	48.3	35.4	39.8	82.5	52.5	59.7	24.1	39.0	39.7	34.5
OV-Seg	79.8	40.4	66.1	69.6	81.2	92.1	71.4	49.1	88.9	89.2	80.6	65.3
FFD	56.6	86.9	3.7	9.5	42.9	82.6	25.1	51.4	6.1	42.8	7.9	30.1
Sem(ODISE)	50.3	86.5	27.7	22.2	24.2	80.5	29.5	61.5	25.6	56.4	18.4	30.8
Sem(OV-Seg)	89.3	96.7	66.3	89.0	87.6	95.4	53.8	81.9	94.2	98.5	83.8	94.6
LERF	73.5	86.9	27.0	43.8	73.7	93.5	46.6	79.8	53.2	79.7	33.4	41.0
Ours	89.5	96.7	74.0	91.6	88.2	97.3	92.8	98.9	89.3	96.3	88.8

提案手法は、セグメンテーション注釈なしで複数のシーンにおいて、いくつかの2Dおよび3Dオープンボキャブラリのベースラインを上回る。
CLIP由来の特徴をSelection Volumeとマルチスケールパッチで3Dへ持ち上げ、ビュー一貫性のある分割を実現できる。
RDAとFDA損失はCLIPの曖昧さを緩和し、DINO境界を蒸留するうえで重要である。
入力ビューやスケールが限定されても、手法は競争力を維持し、頑健性を示す。
いくつかのシーンでは、分割マスクを用いた完全監視モデルを上回ることがある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。