QUICK REVIEW

[論文レビュー] DeepFusion: Lidar-Camera Deep Fusion for Multi-Modal 3D Object Detection

Yingwei Li, Adams Wei Yu|arXiv (Cornell University)|Mar 15, 2022

Advanced Neural Network Applications被引用数 24

ひとこと要約

この論文は、InverseAugとLearnableAlignを用いてリーダとカメラデータを統合する深層特徴フュージョンアプローチDeepFusionを提案し、Waymo Open Datasetでの長距離検出の向上とともに、最先端のマルチモーダル3D物体検出を達成する。

ABSTRACT

Lidars and cameras are critical sensors that provide complementary information for 3D detection in autonomous driving. While prevalent multi-modal methods simply decorate raw lidar point clouds with camera features and feed them directly to existing 3D detection models, our study shows that fusing camera features with deep lidar features instead of raw points, can lead to better performance. However, as those features are often augmented and aggregated, a key challenge in fusion is how to effectively align the transformed features from two modalities. In this paper, we propose two novel techniques: InverseAug that inverses geometric-related augmentations, e.g., rotation, to enable accurate geometric alignment between lidar points and image pixels, and LearnableAlign that leverages cross-attention to dynamically capture the correlations between image and lidar features during fusion. Based on InverseAug and LearnableAlign, we develop a family of generic multi-modal 3D detection models named DeepFusion, which is more accurate than previous methods. For example, DeepFusion improves PointPillars, CenterPoint, and 3D-MAN baselines on Pedestrian detection for 6.7, 8.9, and 6.2 LEVEL_2 APH, respectively. Notably, our models achieve state-of-the-art performance on Waymo Open Dataset, and show strong model robustness against input corruptions and out-of-distribution data. Code will be publicly available at https://github.com/tensorflow/lingvo/tree/master/lingvo/.

研究の動機と目的

生の点ではなく、リーダとカメラからの深い特徴を融合して、効果的なマルチモーダル3D物体検出を推進する。
幾何学に関連するデータ拡張に対処するためのアライメント意識のフュージョン手法を提案する。
既存のボクセルベース3D検出器に組み込んで性能を向上させる汎用的なDeepFusionフレームワークを開発する。
入力破損や分布外データに対する頑健性を示す。

提案手法

Pillar/ボクセルベースのバックボーンからのリーダ特徴を、2D画像バックボーンからのカメラ特徴と深層特徴フュージョンパイプラインとして統合する。
正確なクロスモーダルアライメントのために、ジオメトリ関連データ拡張を反転させるInverseAugを導入する。
Voxelレベルのリルダ特徴と整合するよう、カメラ特徴の動的ウェイト付けを行うクロスアテンションモジュールLearnableAlignを導入する。
DeepFusionを、PointPillars、CenterPoint、3D-MANなどの既存のボクセルベース検出器をエンドツーエンドで訓練可能なプラグインとして位置づける。
深層特徴レベルのカメラ特徴が、エンドツーエンド訓練を通じてミスアライメントなしに高解像度の文脈情報を提供することを示す。
Waymo Open Datasetで評価し、単一モーダルのベースラインや従来のマルチモーダル手法と比較する。

実験結果

リサーチクエスチョン

RQ1深層カメラとリダ特徴を揃えることは、入力レベルのフュージョンと比べてマルチモーダル3D検出を改善するか？
RQ2InverseAugとLearnableAlignは、多様なデータ拡張や実世界条件下で堅牢で正確なクロスモーダル整合性を提供できるか？
RQ3DeepFusionは、さまざまなボクセルベースの3D検出器を改善する汎用のプラグインフュージョン手法か？
RQ4マルチモーダルDeepFusionは長距離物体検出および分布シフトや入力破損下でどのように性能を発揮するか？

主な発見

方法	AP/L1	APH/L1	AP/L2	APH/L2
DeepFusion-Ens (ours) ∗	84.37	83.22	79.54	78.41
InceptioLidar	83.80	82.46	79.15	77.84
AFDetV2-Ens [12]	84.07	82.63	79.04	77.64
Octopus_Noah	83.10	81.67	78.65	77.27
HorizonLiDAR3D [6] ∗	83.28	81.85	78.49	77.11
DeepFusion (ours) ∗	81.89	80.48	76.91	75.54
Cascade3D	81.17	79.63	75.84	74.36
INT	80.29	78.81	75.30	73.89
IUI	80.00	78.60	74.94	73.60
XMU	80.53	78.77	75.14	73.45
Octopus-det	79.25	77.75	74.63	73.20
AFDetV2 [12] ∗	79.77	78.21	74.60	73.12
LENOVO_LR_PCIE_Det	79.46	78.07	74.31	72.97
LENOVO_LR_PCIE_RT_Det	79.42	77.98	74.31	72.97
CenterPoint++ [44] ∗	79.41	77.96	74.22	72.82
SST_v1 [8] ∗	79.99	78.31	74.41	72.81

DeepFusionはWaymo Open Datasetの複数のベースライン検出器を改善し、特に長距離オブジェクトで顕著な向上を示す。
Waymo validationでDeepFusion-Ensがリードする性能を達成し、DeepFusionはLEVEL_2 APHで従来のマルチモーダル手法を上回る。
アブレーションによりInverseAugとLearnableAlignの両方が性能に寄与し、InverseAugの方がより大きな利得を提供することが示される。
DeepFusionはPointPillars、CenterPoint、3D-MANのベースラインにわたってリーダのみのモデルを一貫して改善する（例: LEVEL_2でAPHが+6.7〜+8.9の向上）。
DeepFusionは最先端のWaymo結果を達成し、入力破損およびOODデータに対する頑健性を示す。摂動下で単一モーダルモデルより劣化が小さい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。