QUICK REVIEW

[Paper Review] Multimodal Virtual Point 3D Detection

Tianwei Yin, Xingyi Zhou|arXiv (Cornell University)|Nov 12, 2021

Advanced Optical Sensing Technologies127 citations

TL;DR

The paper introduces MVP, a plug-and-play method that lifts 2D RGB detections into dense 3D virtual points to augment sparse LiDAR data, improving CenterPoint-based 3D detection on nuScenes by 6.6 mAP without ensembles.

ABSTRACT

Lidar-based sensing drives current autonomous vehicles. Despite rapid progress, current Lidar sensors still lag two decades behind traditional color cameras in terms of resolution and cost. For autonomous driving, this means that large objects close to the sensors are easily visible, but far-away or small objects comprise only one measurement or two. This is an issue, especially when these objects turn out to be driving hazards. On the other hand, these same objects are clearly visible in onboard RGB sensors. In this work, we present an approach to seamlessly fuse RGB sensors into Lidar-based 3D recognition. Our approach takes a set of 2D detections to generate dense 3D virtual points to augment an otherwise sparse 3D point cloud. These virtual points naturally integrate into any standard Lidar-based 3D detectors along with regular Lidar measurements. The resulting multi-modal detector is simple and effective. Experimental results on the large-scale nuScenes dataset show that our framework improves a strong CenterPoint baseline by a significant 6.6 mAP, and outperforms competing fusion approaches. Code and more visualizations are available at https://tianweiy.github.io/mvp/

Motivation & Objective

Motivate improving 3D perception for autonomous driving where LiDAR is sparse at long range but RGB is rich in detail.
Propose a simple, plug-and-play fusion scheme that augments LiDAR with dense virtual points derived from 2D detections.
Enable seamless integration with existing 3D detectors by modifying the input feature representation rather than the backbone.
Demonstrate that dense virtual points improve detection accuracy, especially for small and distant objects, on a large-scale dataset.

Proposed method

Generate n τ virtual points per detected object j using 2D instance masks from CenterNet2.
Project LiDAR points into the RGB camera frame to form frustums Fj for each 2D detection.
Sample τ 2D points within each instance mask and assign depth from the nearest LiDAR projection within Fj.
Unproject sampled points back to 3D using depth and append the object's semantic features to form virtual points.
Concatenate virtual points with real LiDAR points by separately averaging virtual and real point features before feeding into a CenterPoint-style backbone.
Optionally use a second-stage refinement that leverages surface center features for improved localization.

Experimental results

Research questions

RQ1Can dense 3D virtual points generated from 2D detections meaningfully improve LiDAR-based 3D detectors on urban scenes?
RQ2How does MVP interact with existing backbones (VoxelNet, PointPillars) and detectors (CenterPoint) without ensembles or TTA?
RQ3How robust is MVP to variations in 2D detection quality and depth estimation accuracy?
RQ4What gains are achieved across object distances (near vs far) and across object categories in nuScenes?

Key findings

Method	mAP	NDS	Car	Truck	Bus	Trailer	CV	Ped	Motor	Bicycle	TC	Barrier
PointPillars [23]	30.5	45.3	68.4	23.0	28.2	23.4	4.1	59.7	27.4	1.1	30.8	38.9
WYSIWYG [19]	35.0	41.9	79.1	30.4	46.6	40.1	7.1	65.0	18.2	0.1	28.8	34.7
3DSSD [62]	42.6	56.4	81.2	47.2	61.4	30.5	12.6	70.2	36.0	8.6	31.1	47.9
PMPNet [65]	45.4	53.1	79.7	33.6	47.1	43.1	18.1	76.5	40.7	7.9	58.8	48.8
PointPainting [52]	46.4	58.1	77.9	35.8	36.2	37.3	15.8	73.3	41.5	24.1	62.4	60.2
CBGS [76]	52.8	63.3	81.1	48.5	54.9	42.9	10.5	80.1	51.5	22.3	70.9	65.7
CVCNet [4]	55.3	64.4	82.7	46.1	46.6	49.4	22.6	79.8	59.1	31.4	65.6	69.6
HotSpotNet [5]	59.3	66.0	83.1	50.9	56.4	53.3	23.0	81.3	63.5	36.6	73.0	71.6
CenterPoint [66]	58.0	65.5	84.6	51.0	60.2	53.2	17.5	83.4	53.7	28.7	76.7	70.9
MVP (Ours)	66.4	70.5	86.8	58.5	67.4	57.3	26.1	89.1	70.0	49.3	85.0	74.8

MVP improves a strong CenterPoint baseline by 6.6 mAP on nuScenes.
MVP achieves 66.4 mAP and 70.5 NDS without ensembles, surpassing all non-ensembled methods on nuScenes at submission time.
Dense virtual points yield large gains for small objects (e.g., +11 mAP for small objects, +20.6 for Bicycle, +16.3 for motorcycle).
Compared to 2D-only detectors, the 2D CenterNet outperforms CenterPoint in 2D localization by 9.8 mAP, highlighting the value of high-resolution RGB cues for 3D detection.
Ablation shows virtual points alone provide substantial gains (6.3 mAP with VoxelNet, 10.4 mAP with PointPillars); two-stage refinement adds further improvements (≈1.1 mAP, ≈0.8 NDS).
On KITTI, MVP provides measurable gains (0.5 mAP for Car, 2.3 mAP for Cyclist) illustrating generalization.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.