QUICK REVIEW

[论文解读] Segment Any Point Cloud Sequences by Distilling Vision Foundation Models

Youquan Liu, Lingdong Kong|arXiv (Cornell University)|Jun 15, 2023

3D Shape Modeling and Analysis被引用 29

一句话总结

Seal 将视觉基础模型用于学习自监督、语义感知的三维表示，来源于汽车点云，支持跨越不同数据集的可扩展、一致且可泛化的分割。

ABSTRACT

Recent advancements in vision foundation models (VFMs) have opened up new possibilities for versatile and efficient visual perception. In this work, we introduce Seal, a novel framework that harnesses VFMs for segmenting diverse automotive point cloud sequences. Seal exhibits three appealing properties: i) Scalability: VFMs are directly distilled into point clouds, obviating the need for annotations in either 2D or 3D during pretraining. ii) Consistency: Spatial and temporal relationships are enforced at both the camera-to-LiDAR and point-to-segment regularization stages, facilitating cross-modal representation learning. iii) Generalizability: Seal enables knowledge transfer in an off-the-shelf manner to downstream tasks involving diverse point clouds, including those from real/synthetic, low/high-resolution, large/small-scale, and clean/corrupted datasets. Extensive experiments conducted on eleven different point cloud datasets showcase the effectiveness and superiority of Seal. Notably, Seal achieves a remarkable 45.0% mIoU on nuScenes after linear probing, surpassing random initialization by 36.9% mIoU and outperforming prior arts by 6.1% mIoU. Moreover, Seal demonstrates significant performance gains over existing methods across 20 different few-shot fine-tuning tasks on all eleven tested point cloud datasets.

研究动机与目标

通过在原始点云上直接进行预训练以消除注释需求，使用 VFMs。
在相机到 LiDAR 与点到分割关系之间强制时空一致性。
利用来自二维 VFM 的语义信息来引导三维表示学习。
实现对真实/合成、低/高分辨率以及受损数据的强跨数据集泛化。
提供一个现成的迁移机制用于下游三维分割任务。

提出的方法

在相机视图中使用视觉基础模型生成语义超像素。
通过跨模态对比损失（L^vfm）使 LiDAR 点特征与图像超像素特征对齐，从而蒸馏 2D–3D 知识。
将三维点特征和二维图像特征投射到一个共享嵌入空间，使用可训练的头部（Q 和 K）并进行归一化。
通过相邻时间戳之间的时序超点正则化损失（L^tmp）来强制语义一致性。
对点到分割施加正则化（L^p2s），将点特征拉向相应分割的均值。
将 L^vfm、L^tmp 与 L^p2s 结合为最终目标。
通过使用时序聚合和非地面分割簇来处理不完美的相机- LiDAR 同步，加入鲁棒几何策略。

实验结果

研究问题

RQ1视觉基础模型是否能为三维点云分割提供没有三维注释的语义性监督？
RQ2跨模态（2D–3D）蒸馏是否能在多样数据集上提升汽车 LiDAR 数据的表示学习？
RQ3语义超点的时序一致性是否能提升对不同传感器和条件的鲁棒性与泛化能力？
RQ4所学表示是否可转移到下游任务和具有不同保真度及噪声的数据集？
RQ5不同的视觉基础模型如何影响跨模态蒸馏和最终分割性能？

主要发现

Seal 在 nuScenes 的线性探测性能达到强烈水平，mIoU 为 45.0%，较随机初始化提升 36.9%，较先前工作提升 6.1 个点 mIoU。
Seal 在 11 个数据集、20+ 的少样本微调任务中持续超越先前方法。
在 nuScenes-C 的鲁棒性测试中，Seal 显示出更强的鲁棒性，且在多种腐蚀条件下整体 mIoU 更高。
不同 VFM 提供的增益各异；SEEM 和 SAM 通常带来比基于 SLIC 的基线更大的改进，且 Seal 始终优于 SLidR。
部分标注的半监督变体仍保持强劲性能，有时甚至超过某些全监督方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。