QUICK REVIEW

[论文解读] P4Contrast: Contrastive Learning with Pairs of Point-Pixel Pairs for RGB-D Scene Understanding

Yunze Liu, Yi Li|arXiv (Cornell University)|Dec 24, 2020

Domain Adaptation and Few-Shot Learning参考文献 59被引用 34

一句话总结

P4Contrast 引入了一种在点-像素对对上进行对比预训练的任务，以融合 RGB 与几何信息用于 RGB-D 场景理解，在 ScanNet、SUN RGB-D 和 3RScan 上实现了更好的语义分割和 3D 物体检测。

ABSTRACT

Self-supervised representation learning is a critical problem in computer vision, as it provides a way to pretrain feature extractors on large unlabeled datasets that can be used as an initialization for more efficient and effective training on downstream tasks. A promising approach is to use contrastive learning to learn a latent space where features are close for similar data samples and far apart for dissimilar ones. This approach has demonstrated tremendous success for pretraining both image and point cloud feature extractors, but it has been barely investigated for multi-modal RGB-D scans, especially with the goal of facilitating high-level scene understanding. To solve this problem, we propose contrasting "pairs of point-pixel pairs", where positives include pairs of RGB-D points in correspondence, and negatives include pairs where one of the two modalities has been disturbed and/or the two RGB-D points are not in correspondence. This provides extra flexibility in making hard negatives and helps networks to learn features from both modalities, not just the more discriminating one of the two. Experiments show that this proposed approach yields better performance on three large-scale RGB-D scene understanding benchmarks (ScanNet, SUN RGB-D, and 3RScan) than previous pretraining approaches.

研究动机与目标

倡导自监督学习用于适合三维场景理解的密集 RGB-D 表征。
提出一种新颖的预训练任务，使用点-像素对的对来融合 RGB 和几何信息。
展示该方法在多项 RGB-D 基准上实现了最先进的提升。

提出的方法

将点-像素对定义为来自同步的 RGB 与深度观测的密集 RGB-D 令牌。
为每个场景创建两种视图，并构建锚点/正样本/负样本点-像素对，包含扰动对作为负样本以强制联合 RGB-几何学习。
使用 PairInfoNCE 损失将锚点–正样本拉近，锚点–负样本拉远。
采用结合 SR-UNet（3D）和 FuseNet（2D）的 2D-3D 上下文骨干网络，以获得融合的 RGB-D 表征。
对部分扰动的负样本应用渐进难度调度，以平衡学习难度。
使用 RGB-D 数据增强进行训练，包括用于 3D 的点抖动和用于 RGB 的高斯噪声。

实验结果

研究问题

RQ1是否对点-像素对对进行对比目标能够比单一模态或简单跨模态对比更有效地促进 RGB-D 融合？
RQ2相比仅 3D 或仅 2D 的基线，2D-3D 上下文骨干是否能提升 RGB-D 特征学习？
RQ3扰动的（部分负样本）点-像素对是否提升对 jointly informative RGB-D 特征的学习？

主要发现

方法	输入	mIoU_K5	mIoU_K3
从零开始训练	Geo	71.3	72.1
PointContrast [63]	Geo+RGB	N/A	74.1
PointContrast 1 1 脚注标记: 1	Geo	72.4	73.2
PointContrast 1 1 脚注标记: 1	Geo+RGB	72.7	73.8
P4Contrast(3D context)	Geo+RGB	73.6	74.3
P4Contrast(2D-3D context)	Geo+RGB	74.6	75.0

P4Contrast 在三个任务上提升下游性能：在 ScanNetV2 和 3RScan 的语义分割，以及在 SUN RGB-D 的 3D 物体检测。
在 ScanNetV2 语义分割中，P4Contrast(2D-3D context) 达到 75.0 mIoU (K3)，相较基线 72.1 (K5) 与某些 PointContrast 变体中的 73.8。
在 3RScan 语义分割上，P4Contrast(2D-3D context) 达到 41.7 mIoU，高于 38.8 (PointContrast) 和 37.3（从零开始训练）。
在 SUN RGB-D 的 3D 物体检测中，P4Contrast 实现63.5 mAP@0.25，超过 VoteNet、PointContrast 和 ImVoteNet 基线。
在有限数据微调时，P4Contrast 取得显著提升，例如仅使用 10% 的 ScanNet 训练数据就提升 4.5 mIoU。
具有联合 RGB-D 融合的 2D-3D 上下文骨干网络优于单模态或简单 RGB 增强的点方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。