QUICK REVIEW

[論文レビュー] FCOS3D: Fully Convolutional One-Stage Monocular 3D Object Detection

Tai Wang, Xinge Zhu|arXiv (Cornell University)|Apr 22, 2021

Advanced Neural Network Applications参考文献 38被引用数 27

ひとこと要約

FCOS3D はアンカー不要の 2D 検出器を単眼の 3D 物体検出へ適用し、3D ターゲットを画像平面へ投影して 3D センター基盤の監督と多段階の 3D 予測を用いる。 vision-only 手法の中で nuScenes のカメラトラックで最高の性能を達成。

ABSTRACT

Monocular 3D object detection is an important task for autonomous driving considering its advantage of low cost. It is much more challenging than conventional 2D cases due to its inherent ill-posed property, which is mainly reflected in the lack of depth information. Recent progress on 2D detection offers opportunities to better solving this problem. However, it is non-trivial to make a general adapted 2D detector work in this 3D task. In this paper, we study this problem with a practice built on a fully convolutional single-stage detector and propose a general framework FCOS3D. Specifically, we first transform the commonly defined 7-DoF 3D targets to the image domain and decouple them as 2D and 3D attributes. Then the objects are distributed to different feature levels with consideration of their 2D scales and assigned only according to the projected 3D-center for the training procedure. Furthermore, the center-ness is redefined with a 2D Gaussian distribution based on the 3D-center to fit the 3D target formulation. All of these make this framework simple yet effective, getting rid of any 2D detection or 2D-3D correspondence priors. Our solution achieves 1st place out of all the vision-only methods in the nuScenes 3D detection challenge of NeurIPS 2020. Code and models are released at https://github.com/open-mmlab/mmdetection3d.

研究の動機と目的

Translate 7-DoF 3D targets into image-domain representations to leverage 2D detector strengths.
Decouple 3D attributes into 2D center shifts and 3D size/pose for regression.
Distribute targets across feature pyramid levels guided by 2D scales and 3D centers.
Redefine center-ness with a 2D Gaussian based on the 3D center to reflect 3D target geometry.
Achieve monocular 3D detection without 2D-3D priors while maintaining training/inference efficiency.

提案手法

Build on FCOS with a ResNet101 backbone and FPN to create multi-scale feature maps (P3–P7).
Project 3D targets to the image to obtain a 2.5D center and decouple into 2D offsets (Δx, Δy) and depth (d), plus 3D size and orientation (w, l, h, θ, vx, vy).
Predict rotation using a 2-bin direction encoding plus an angle component to resolve opposite orientations.
Assign targets to feature levels using 2D scale guidance and 3D-center-based foreground criteria; use distance-based center sampling to mitigate ambiguity.
Use a center-ness score c modeled as a 2D Gaussian around the projected 3D center; train c with BCE loss.
Train with focal loss for classification, softmax/BCE for attributes and direction, and Smooth-L1 for regression targets with carefully chosen weights.

実験結果

リサーチクエスチョン

RQ1Can a simple anchor-free 2D detector be repurposed to predict 3D attributes from monocular images without 2D-3D priors?
RQ2How should 3D targets be reformulated and assigned to 2D feature levels to maximize recall and accuracy in monocular 3D detection?
RQ3Does a 2D Gaussian-based center-ness tied to a projected 3D center better suppress low-quality predictions than the original FCOS center-ness in 3D settings?
RQ4What is the impact of depth re-parametrization and disentangled regression heads on 3D orientation and overall detection score in nuScenes?
RQ5What are the performance benefits of depth-space loss re-parameterization and distance-based target assignment for large objects?

主な発見

手法	データセット	モダリティ	mAP	mATE	mASE	mAOE	mAVE	mAAE	NDS
CenterFusion	test	Camera & Radar	0.326	0.631	0.261	0.516	0.614	0.115	0.449
PointPillars	test	LiDAR	0.305	0.517	0.290	0.500	0.316	0.368	0.453
MEGVII	test	LiDAR	0.528	0.300	0.247	0.379	0.245	0.140	0.633
LRM0	test	Camera	0.294	0.752	0.265	0.603	1.582	0.14	0.371
MonoDIS	test	Camera	0.304	0.738	0.263	0.546	1.553	0.134	0.384
CenterNet	test	Camera (HGLS)	0.338	0.658	0.255	0.629	1.629	0.142	0.4
Noah CV Lab	test	Camera	0.331	0.660	0.262	0.354	1.663	0.198	0.418
FCOS3D (Ours)	test	Camera	0.358	0.690	0.249	0.452	1.434	0.124	0.428
CenterNet	val	Camera (HGLS)	0.306	0.716	0.264	0.609	1.426	0.658	0.328
FCOS3D (Ours)	val	Camera	0.343	0.725	0.263	0.422	1.292	0.153	0.415

FCOS3D achieves 0.358 mAP and 0.428 NDS on nuScenes test set (RGB input), outperforming several RGB-only baselines.
On the validation set, FCOS3D attains 0.343 mAP and 0.415 NDS, showing solid gains over prior RGB-based monocular detectors.
Compared to LiDAR-based and multi-modal methods, FCOS3D with RGB input reaches competitive mAP and angles predictions, with notable improvements in rotation handling due to the 2-bin orientation encoding.
Ablations show that depth loss in original space, distance-based target assignment, stronger backbones (ResNet101, DCN), and disentangled regression heads substantially improve mAP and NDS.
The final architecture benefits from test-time augmentation and more training epochs, achieving state-of-the-art among vision-only approaches in the nuScenes camera track.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。