[論文レビュー] SurroundOcc: Multi-Camera 3D Occupancy Prediction for Autonomous Driving
SurroundOcc は 2D-3D 空間注意機構とマルチスケール 3D ボリュームを用いて、マルチカメラ画像から密な 3D occupancy を予測する。 密な occupancy ground-truth 生成パイプラインを持ち、nuScenes と SemanticKITTI で最先端の結果を達成する。
3D scene understanding plays a vital role in vision-based autonomous driving. While most existing methods focus on 3D object detection, they have difficulty describing real-world objects of arbitrary shapes and infinite classes. Towards a more comprehensive perception of a 3D scene, in this paper, we propose a SurroundOcc method to predict the 3D occupancy with multi-camera images. We first extract multi-scale features for each image and adopt spatial 2D-3D attention to lift them to the 3D volume space. Then we apply 3D convolutions to progressively upsample the volume features and impose supervision on multiple levels. To obtain dense occupancy prediction, we design a pipeline to generate dense occupancy ground truth without expansive occupancy annotations. Specifically, we fuse multi-frame LiDAR scans of dynamic objects and static scenes separately. Then we adopt Poisson Reconstruction to fill the holes and voxelize the mesh to get dense occupancy labels. Extensive experiments on nuScenes and SemanticKITTI datasets demonstrate the superiority of our method. Code and dataset are available at https://github.com/weiyithu/SurroundOcc
研究の動機と目的
- Motivate and enable dense 3D scene understanding from multi-camera inputs beyond sparse object detection.
- Develop a framework that lifts 2D multi-view features into a 3D occupancy volume.
- Predict dense 3D occupancy through multi-scale 3D volume upsampling with effective supervision.
- Create a practical pipeline to generate dense occupancy ground truth without expensive annotations.
提案手法
- Extract multi-scale 2D features from each camera image using a backbone network.
- Apply 2D-3D spatial attention to lift multi-camera features into a 3D volume space instead of BEV.
- Use a multi-scale 3D UNet-like architecture to progressively upsample and fuse volume features.
- Supervise occupancy predictions at multiple levels with decayed loss weights to encourage detail preservation.
- Generate dense occupancy ground truth by stitching multi-frame LiDAR data (static and dynamic) and applying Poisson reconstruction, followed by voxelization and NN-based semantic labeling.
実験結果
リサーチクエスチョン
- RQ1Can dense 3D occupancy be reliably predicted from multi-camera images using a 3D voxel representation?
- RQ2Does 3D volume-based cross-view fusion outperform BEV-based fusion for multi-camera occupancy prediction?
- RQ3What is the impact of multi-scale supervision and dense ground-truth occupancy on prediction quality?
- RQ4Can a dense occupancy ground truth pipeline using multi-frame LiDAR and Poisson reconstruction provide effective supervision without manual annotations?
- RQ5How does SurroundOcc perform on standard benchmarks like nuScenes and SemanticKITTI for 3D semantic occupancy and scene reconstruction?
主な発見
- SurroundOcc は nuScenes の 3D semantic occupancy prediction および 3D scene reconstruction のベンチマークで最先端の性能を達成。
- この手法は SemanticKITTI においてモノスクリプト入力を想定していないにもかかわらず、強力なモノキュラ semantic scene completion の結果を示す。
- 3D ボリュームベースのクロスビュー注意機構は BEV ベースの融合よりも 3D 空間情報を良く保持する。
- 多スケール occupancy 予測と密なグラウンドトゥルース監督は、スパースな LiDAR 監督と比較して occupancy density と品質を大幅に向上させる。
- 多フレームの stitching と Poisson reconstruction による密な occupancy グラウンドトゥルース生成は、単一フレーム LiDAR 点やスパース occupancy アノテーションを用いる場合よりも優れている。
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。