QUICK REVIEW

[論文レビュー] OpenShape: Scaling Up 3D Shape Representation Towards Open-World Understanding

Minghua Liu, Ruoxi Shi|arXiv (Cornell University)|May 18, 2023

Human Pose and Action Recognition被引用数 39

ひとこと要約

OpenShape は、言語と2D画像に整合した、スケーラブルなマルチモーダル埋め込みを3D形状に学習し、強力なゼロショットのオープンワールド3D認識とクロスモーダルタスクを実現します。

ABSTRACT

We introduce OpenShape, a method for learning multi-modal joint representations of text, image, and point clouds. We adopt the commonly used multi-modal contrastive learning framework for representation alignment, but with a specific focus on scaling up 3D representations to enable open-world 3D shape understanding. To achieve this, we scale up training data by ensembling multiple 3D datasets and propose several strategies to automatically filter and enrich noisy text descriptions. We also explore and compare strategies for scaling 3D backbone networks and introduce a novel hard negative mining module for more efficient training. We evaluate OpenShape on zero-shot 3D classification benchmarks and demonstrate its superior capabilities for open-world recognition. Specifically, OpenShape achieves a zero-shot accuracy of 46.8% on the 1,156-category Objaverse-LVIS benchmark, compared to less than 10% for existing methods. OpenShape also achieves an accuracy of 85.3% on ModelNet40, outperforming previous zero-shot baseline methods by 20% and performing on par with some fully-supervised methods. Furthermore, we show that our learned embeddings encode a wide range of visual and semantic concepts (e.g., subcategories, color, shape, style) and facilitate fine-grained text-3D and image-3D interactions. Due to their alignment with CLIP embeddings, our learned shape representations can also be integrated with off-the-shelf CLIP-based models for various applications, such as point cloud captioning and point cloud-conditioned image generation.

研究の動機と目的

小規模に厳選されたデータセットを超えたオープンワールド3D形状理解を動機づける。
3Dデータを拡張し、ノイズの多いテキスト記述を充実させて、CLIP空間との意味的整合性を改善する。
3Dバックボーンのスケーリングを調査し、データ不均衡に対処するためのハードネガティブマイニングを導入する。
CLIP整合埋め込みを用いた強力なゼロショット3D分類とクロスモーダル機能を実証する。

提案手法

学習のために4つの大規模公開3Dデータセットをアンサンブルして876k形状とする。
CLIPエンコーダを固定しつつ、3D形状埋め込みをCLIP言語空間および画像空間と整合させるためのマルチモーダル対比学習を用いる。
GPT-4フィルタリング、BLIP/Azureキャプション、および画像取得キャプションを用いてテキスト記述をフィルタリング・充実させ、3D–テキストの整合性を高める。
3Dバックボーンを拡大する（例: PointBERT, SparseConv）と、トレーニングデータサイズに対する性能を比較する。
オフラインのハードネガティブマイニングを導入して、挑戦的なバッチ構成を作成し識別学習を改善する。
共通の埋め込み空間を活用して、クロスモーダルタスクの実現とCLIPベースモデルとの統合を図る。

Figure 1 : Left : Zero-shot shape classification on the Objaverse-LVIS (1,156 categories) and ModelNet40 datasets. OpenShape outperforms previous methods by a large margin. We exclude shapes in Objaverse-LVIS during training, and we also retrain ULIP [ 75 ] on our ensembled training shapes for fair

実験結果

リサーチクエスチョン

RQ1大規模なマルチモーダル事前学習の3D形状は、オープンワールド理解と強力なゼロショット性能を達成できるか？
RQ2データスケール、テキスト品質、バックボーンのスケーリング、ハードネガティブマイニングは、ゼロショット3D認識にどう影響するか？
RQ33D埋め込みをCLIP言語空間および画像空間と整合させることが、クロスモーダルタスクにどのような影響を与えるか？
RQ4OpenShape埋め込みは、モダリティ間の検索・生成タスクをどのように実現するか？
RQ5ゼロショット3D認識は、標準ベンチマークで完全に教師ありのベースラインにどれだけ近づくことができるか、あるいはそれを超えるか？

主な発見

手法	トレーニングデータ	Objaverse-LVIS Top1	Objaverse-LVIS Top3	Objaverse-LVIS Top5	ModelNet40 Top1	ModelNet40 Top3	ModelNet40 Top5	ScanObjectNN Top1	ScanObjectNN Top3	ScanObjectNN Top5
OpenShape-SparseConv	Ensembled	43.4	64.8	72.4	83.4	95.6	97.8	56.7	78.9	88.6
OpenShape-PointBERT	Ensembled	46.8	69.1	77.0	84.4	96.5	98.0	52.2	79.7	88.7
ULIP-PointBERT (Retrained)	Ensembled	26.8	44.8	52.6	75.1	88.1	93.2	51.6	72.5	82.3
OpenShape-SparseConv	(no LVIS)	37.0	58.4	66.9	82.6	95.0	97.5	54.9	76.8	87.0
OpenShape-PointBERT	(no LVIS)	39.1	60.8	68.9	85.3	96.2	97.4	47.2	72.4	84.7
ULIP-PointBERT (Retrained)	(Ensembled)	21.4	38.1	46.0	71.4	84.4	89.2	46.0	66.1	76.4

OpenShape は Objaverse-LVIS (1,156 カテゴリ) で 46.8% のゼロショット精度を達成し、従来手法を大きく上回る。
ModelNet40 では OpenShape が 85.3% のゼロショット精度に達し、従来のゼロショットベースラインを少なくとも20%上回る。
OpenShape は、画像やテキストからの形状検索を含む強力なfew-shotおよびクロスモーダル検索能力を示し、形状条件付き生成を可能にする。
3D埋め込みをCLIP空間に整合させることで、点群キャプショニングや点群条件付き画像生成といったCLIPベースモデルとの統合を可能にする。
バックボーンのスケーリングとテキスト充実戦略は性能を大幅に向上させ、ハードネガティブマイニングはデータの不均衡対策として有効である。

Figure 2 : (a) We ensemble four public 3D shape datasets, resulting in 876k shapes that encompass diverse categories and concepts. (b) We propose three strategies to automatically filter and enrich the noisy texts in the original datasets. (c) We train a 3D point cloud encoder to align the 3D shape

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。