QUICK REVIEW

[论文解读] SldprtNet: A Large-Scale Multimodal Dataset for CAD Generation in Language-Driven 3D Design

Ruogu Li, Sikai Li|arXiv (Cornell University)|Mar 13, 2026

3D Shape Modeling and Analysis被引用 0

一句话总结

SldprtNet 引入一个大规模多模态 CAD 数据集（242k 部件），具备对齐的 3D 模型、多视图图像、参数化脚本和自然语言描述，以及编码器/解码器工具，以实现无损的文本-CAD 转换和语言驱动的 CAD 生成。

ABSTRACT

We introduce SldprtNet, a large-scale dataset comprising over 242,000 industrial parts, designed for semantic-driven CAD modeling, geometric deep learning, and the training and fine-tuning of multimodal models for 3D design. The dataset provides 3D models in both .step and .sldprt formats to support diverse training and testing. To enable parametric modeling and facilitate dataset scalability, we developed supporting tools, an encoder and a decoder, which support 13 types of CAD commands and enable lossless transformation between 3D models and a structured text representation. Additionally, each sample is paired with a composite image created by merging seven rendered views from different viewpoints of the 3D model, effectively reducing input token length and accelerating inference. By combining this image with the parameterized text output from the encoder, we employ the lightweight multimodal language model Qwen2.5-VL-7B to generate a natural language description of each part's appearance and functionality. To ensure accuracy, we manually verified and aligned the generated descriptions, rendered images, and 3D models. These descriptions, along with the parameterized modeling scripts, rendered images, and 3D model files, are fully aligned to construct SldprtNet. To assess its effectiveness, we fine-tuned baseline models on a dataset subset, comparing image-plus-text inputs with text-only inputs. Results confirm the necessity and value of multimodal datasets for CAD generation. It features carefully selected real-world industrial parts, supporting tools for scalable dataset expansion, diverse modalities, and ensured diversity in model complexity and geometric features, making it a comprehensive multimodal dataset built for semantic-driven CAD modeling and cross-modal learning.

研究动机与目标

创建一个支持语义驱动建模和多模态学习的大规模 CAD 数据集。
提供参数化表示和工具，在 CAD 模型和文本之间互转。
对齐几何、视图、脚本和描述，以实现双向建模和评估。
证明多模态监督对 Text-to-CAD 任务的有效性。

提出的方法

从公开库中汇集 242k 个工业 CAD 部件，采用 sldprt 和 step 格式。
为每个模型渲染七个视图并将其组成一个图像，以减少输入令牌。
开发一个编码器将 sldprt 转换为结构化的参数文本（Encoder_txt）。
开发一个解码器从 Encoder_txt 重建 sldprt，实现无损的双向变换。
使用多模态大语言模型（Qwen2.5-VL-7B）从复合图像和 Encoder_txt 生成 Des_txt 描述。
人工验证 Des_txt、图像和 3D 模型之间的对齐性，以确保数据集准确性。

实验结果

研究问题

RQ1大型的多模态 CAD 数据集是否能提升语言引导的 CAD 生成和跨模态理解？
RQ2将图像模态加入文本 CAD 建模是否能提高与真实命令和几何的对齐？
RQ3编码器/解码器管线在双向参数化 CAD 转换中有多有效？
RQ4现实世界工业 CAD 语料库中 CAD 特征类型分布与模型复杂度的分布如何？

主要发现

多模态训练（图像 + Encoder_txt）在 Exact Match、Command-Level F1 和 Partial Match 指标上优于文本仅模型。
Exact Match Score: 0.0099（VL） vs 0.0058（文本仅）。
Command-Level F1: 0.3670（VL）vs 0.3247（文本仅）。
Partial Match Rate: 0.6162（VL）vs 0.5554（文本仅）。
6 视图组合图像再加 1 个等距视图在不牺牲几何信息的前提下减少输入长度。
数据集包含 242,606 条样本，具备 13 种核心 CAD 特征及四档复杂度分布（从简单到专家）。
基线结果验证了多模态监督对 Text-to-CAD 任务的价值。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。