QUICK REVIEW

[論文レビュー] SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

Ruogu Li, Sikai Li|arXiv (Cornell University)|Mar 13, 2026

3D Shape Modeling and Analysis被引用数 0

ひとこと要約

SldprtNet は、整列した 3D モデル、マルチビュー画像、パラメトリックスクリプト、自然言語説明を備えた大規模マルチモーダル CAD データセット（242k部品）と、ロスレスなテキスト-CAD 変換と言語駆動 CAD 生成を可能にするエンコーダ/デコーダツールを導入します。

ABSTRACT

We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

研究の動機と目的

意味論的駆動モデリングとマルチモーダル学習を支援する大規模 CAD データセットを作成する。
CAD モデルとテキスト間の変換を可能にするパラメトリック表現とツールを提供する。
形状、ビュー、スクリプト、説明の整合性を取って双方向モデリングと評価を可能にする。
Text-to-CAD タスクのためのマルチモーダル監督の有効性を示す。

提案手法

公的リポジトリから 242k の産業用 CAD 部品を sldprt および step 形式でアセンブルする。
各モデルを七つのビューでレンダリングし、それらを単一の画像に組み合わせて入力トークンを削減する。
sldprt を構造化されたパラメトリックテキスト（Encoder_txt）に変換するエンコーダを開発する。
Encoder_txt から sldprt を再構成するデ코ーダを開発し、ロスレスな双方向変換を実現する。
マルチモーダル LLM（Qwen2.5-VL-7B）を用いて合成画像と Encoder_txt から Des_txt 説明を生成する。
Des_txt、画像、3D モデルの整合性を手動で検証し、データセットの正確性を確保する。

実験結果

リサーチクエスチョン

RQ1大規模でマルチモーダルな CAD データセットは、言語誘導 CAD 生成とクロスモーダル理解を改善できるか。
RQ2テキストのみの CAD モデリングに画像モダリティを追加することで、 ground-truth コマンドと幾何学との整合性が向上するか。
RQ3双方向のパラメトリック CAD 変換のためのエンコーダ/デコーダ・パイプラインはどれほど効果的か。
RQ4実世界の産業 CAD コーパスにおける CAD 特徴タイプの分布とモデルの複雑さはどのようになっているか。

主な発見

Metric	Qwen2.5-7B	Qwen2.5-7B-VL
Exact Match Score	0.0058	0.0099
BLEU Score	97.1827	97.9309
Test Samples	3644	3644
Command-Level F1	0.3247	0.3670
Tolerance Accuracy	0.5016	0.4630
Partial Match Rate	0.5554	0.6162

マルチモーダル学習（画像 + Encoder_txt）は、テキストのみモデルよりも Exact Match、Command-Level F1、および Partial Match の指標が改善される。
Exact Match スコア: 0.0099（VL）対 0.0058（テキストのみ）。
Command-Level F1: 0.3670（VL）対 0.3247（テキストのみ）。
Partial Match Rate: 0.6162（VL）対 0.5554（テキストのみ）。
6 ビューの合成画像 plus 1 つの等尺ビューは、幾何を損なうことなく入力長を短縮する。
データセットは 242,606 サンプルを含み、13 のコア CAD 機能と 4 階層の複雑さ分布（Simple から Expert）を有する。
ベースライン結果は、Text-to-CAD タスクのためのマルチモーダル監督の価値を検証する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。