QUICK REVIEW

[论文解读] Embodied View-Contrastive 3D Feature Learning.

Adam W. Harley, Fangyu Li|arXiv (Cornell University)|Jun 10, 2019

Advanced Vision and Imaging被引用 2

一句话总结

本文提出了一种基于视图对比预测的自监督3D特征学习框架，以提升3D视觉识别性能。通过利用移动相机的视频流，该模型将场景内容与相机运动解耦，将3D特征投影到新视角，并使用对比损失学习鲁棒表征——在半监督和无监督3D目标检测任务中达到最先进性能。

ABSTRACT

Predictive coding theories suggest that the brain learns by predicting observations at various levels of abstraction. One of the most basic prediction tasks is view prediction: how would a given scene look from an alternative viewpoint? Humans excel at this task. Our ability to imagine and fill in missing information is tightly coupled with perception: we feel as if we see the world in 3 dimensions, while in fact, information from only the front surface of the world hits our retinas. This paper explores the role of view prediction in the development of 3D visual recognition. We propose neural 3D mapping networks, which take as input 2.5D (color and depth) video streams captured by a moving camera, and lift them to stable 3D feature maps of the scene, by disentangling the scene content from the motion of the camera. The model also projects its 3D feature maps to novel viewpoints, to predict and match against target views. We propose contrastive prediction losses to replace the standard color regression loss, and show that this leads to better performance on complex photorealistic data. We show that the proposed model learns visual representations useful for (1) semi-supervised learning of 3D object detectors, and (2) unsupervised learning of 3D moving object detectors, by estimating the motion of the inferred 3D feature maps in videos of dynamic scenes. To the best of our knowledge, this is the first work that empirically shows view prediction to be a scalable self-supervised task beneficial to 3D object detection.

研究动机与目标

探究视图预测是否可作为3D视觉识别的可扩展自监督预训练任务。
开发一种神经3D映射网络，以在2.5D视频流（彩色与深度）中解耦场景内容与相机运动。
通过将标准的颜色回归替换为对比预测损失，改进3D特征表征学习。
在下游任务（如半监督3D目标检测和无监督3D运动目标检测）上评估所学习的特征。

提出的方法

模型以移动相机获取的单目2.5D视频流（RGB与深度）作为输入。
通过神经3D映射网络，学习将静态场景内容与动态相机运动解耦，以预测3D特征图。
模型将学习到的3D特征投影到新视角，并与真实目标视图通过对比预测损失进行比较。
使用对比损失替代标准的像素级颜色回归，以促进判别性和泛化性更强的特征学习。
通过端到端的对比预测目标训练网络，以在复杂逼真的数据上提升特征质量。
通过分析3D特征图中的时间变化，实现动态场景中的运动估计，从而实现无监督的运动目标检测。

实验结果

研究问题

RQ1视图预测能否作为3D视觉识别的可扩展自监督预训练任务？
RQ2与标准颜色回归相比，新视角的对比预测是否能带来更优的3D特征表征？
RQ3所学习的3D特征能否提升半监督3D目标检测的性能？
RQ4通过分析3D特征图中的运动，模型能否在无监督条件下检测3D场景中的运动物体？

主要发现

所提出的视图对比学习方法在逼真数据上学习鲁棒3D视觉表征方面，优于标准的颜色回归损失。
通过利用自监督预训练，该模型在半监督3D目标检测任务中达到最先进性能。
通过从3D特征图中估计运动，该模型实现了无监督的3D运动目标检测，证明了所学习表征的实用性。
场景内容与相机运动的解耦使得从单目视频流中稳定构建3D特征图成为可能。
与基于回归的监督相比，对比预测损失能生成更具判别性和泛化能力的特征。
该框架首次通过实证证明，视图预测是3D目标检测中可扩展且有益的自监督任务。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。