QUICK REVIEW

[论文解读] Group3D: MLLM-Driven Semantic Grouping for Open-Vocabulary 3D Object Detection

Youbin Kim, Jinho Park|arXiv (Cornell University)|Mar 23, 2026

Advanced Neural Network Applications被引用 0

一句话总结

Group3D 将多模态语言模型的语义约束整合到多视角开放词汇的三维检测管线中，通过语义兼容性分组对片段合并进行门控，以在无三维监督的情况下提升开集检测。

ABSTRACT

Open-vocabulary 3D object detection aims to localize and recognize objects beyond a fixed training taxonomy. In multi-view RGB settings, recent approaches often decouple geometry-based instance construction from semantic labeling, generating class-agnostic fragments and assigning open-vocabulary categories post hoc. While flexible, such decoupling leaves instance construction governed primarily by geometric consistency, without semantic constraints during merging. When geometric evidence is view-dependent and incomplete, this geometry-only merging can lead to irreversible association errors, including over-merging of distinct objects or fragmentation of a single instance. We propose Group3D, a multi-view open-vocabulary 3D detection framework that integrates semantic constraints directly into the instance construction process. Group3D maintains a scene-adaptive vocabulary derived from a multimodal large language model (MLLM) and organizes it into semantic compatibility groups that encode plausible cross-view category equivalence. These groups act as merge-time constraints: 3D fragments are associated only when they satisfy both semantic compatibility and geometric consistency. This semantically gated merging mitigates geometry-driven over-merging while absorbing multi-view category variability. Group3D supports both pose-known and pose-free settings, relying only on RGB observations. Experiments on ScanNet and ARKitScenes demonstrate that Group3D achieves state-of-the-art performance in multi-view open-vocabulary 3D detection, while exhibiting strong generalization in zero-shot scenarios. The project page is available at https://ubin108.github.io/Group3D/.

研究动机与目标

在无三维 supervision 的情况下解决室内场景的开放词汇三维目标检测问题。
通过在实例构建阶段注入语义约束，缓解几何驱动的过度合并。
使用 MLLM 构建场景自适应词汇表与语义分组。
实现具备姿态已知与仅 RGB 的两种运行方式。
在 ScanNet 与 ARKitScenes 上展示 sate-of-the-art 的表现，并具备强大的零-shot 泛化能力。

提出的方法

通过在各视图查询 MLLM 构建场景自适应的类别集合来构建场景词汇记忆。
提升类别感知的 2D 掩模（通过 SAM）以利用多视几何进行 3D 提升并构建 3D 片段记忆。
使用 MLLM 将场景词汇分割为语义兼容分组，以捕捉跨视图的可行类别等价性。
仅在语义兼容性（同一分组）和体素级几何重叠成立时才合并 3D 片段（IoU 或基于包含的重叠）。
汇聚多视图证据以指派最终的开放词汇标签并计算 3D 边界框。

实验结果

研究问题

RQ1来自 MLLM 的语义先验如何改善开放词汇三维检测中的跨视图片段关联？
RQ2语义门控合并在视图相关性或几何不完整的情况下是否能降低几何驱动的过度合并？
RQ3在没有三维监督或地面实测深度的情况下，基于多视 RGB 的管线是否能实现具有竞争力的开放词汇三维检测？

主要发现

Group3D 在 ScanNet 与 ARKitScenes 的多视角开放词汇三维检测器中实现了最先进的性能。
语义兼容性分组提高了对跨视图标签变异的鲁棒性，并相对于仅几何合并降低了过度合并。
该方法在姿态已知与姿态未知两种设置下均可仅使用 RGB 观测实现并具备零-shot 泛化能力。
消融结果表明语义分组是关键；移除它会降低性能，而每视图的类别假设数量变化对性能影响有限。
该方法可推广到长尾词汇表（ScanNet200）并将语义先验跨数据集迁移。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。