QUICK REVIEW

[論文レビュー] AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection

Zehui Chen, Zhenyu Li|arXiv (Cornell University)|Jul 21, 2022

Advanced Neural Network Applications被引用数 20

ひとこと要約

AutoAlignV2 は Cross-Domain DeformCAFA を導入し、2D 画像機能と LiDAR を効率的に統合して、深度認識データ拡張と画像レベル dropout トレーニングを備えた動的 multi-layer multi-modal 3D 検出を実現します。

ABSTRACT

Point clouds and RGB images are two general perceptional sources in autonomous driving. The former can provide accurate localization of objects, and the latter is denser and richer in semantic information. Recently, AutoAlign presents a learnable paradigm in combining these two modalities for 3D object detection. However, it suffers from high computational cost introduced by the global-wise attention. To solve the problem, we propose Cross-Domain DeformCAFA module in this work. It attends to sparse learnable sampling points for cross-modal relational modeling, which enhances the tolerance to calibration error and greatly speeds up the feature aggregation across different modalities. To overcome the complex GT-AUG under multi-modal settings, we design a simple yet effective cross-modal augmentation strategy on convex combination of image patches given their depth information. Moreover, by carrying out a novel image-level dropout training scheme, our model is able to infer in a dynamic manner. To this end, we propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of AutoAlignV2. Notably, our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results among all published multi-modal 3D object detectors. Code will be available at https://github.com/zehuichen123/AutoAlignV2.

研究の動機と目的

RGB 画像と LiDAR の 3D 物体検出への融合向上を動機付ける。
既存の跨モーダル融合および広範なグローバルアテンションの非効率性に対処する。
計算を削減するための変形可能で疎なサンプリングベースの跨域融合を提案する。
重いマスクを伴わず同期を維持する多模態データ拡張を簡素化する。
実世界システムに合わせて画像の有無で動的推論を可能にする。

提案手法

Cross-Domain DeformCAFA を提案し、学習可能なサンプリングオフセットを用いて跨モーダル融合のために疎な画像点集合へ注意を払う。
Camera-LiDAR 投影によりボクセル中心点から参照点を計算し、K サンプリング位置を用いた M ヘッド間の変形可能な跨注意を適用する。
Cross-domain トークン生成を導入し、特徴をドメイン固有成分とインスタンス固有成分へ因数分解して跨モーダル相互作用を向上させる。
Depth-Aware GT-AUG により深度順序を用いて画像パッチを混合し、複雑なマスキングやフィルタリングなしで同期を維持する。
画像レベル dropout トレーニングにより、画像入力の有無に関係なくアドホック推論を可能にし、トレーニング速度と頑健性を向上させる。
nuScenes 上で CenterPoint および Object DGCNN のベースラインで評価し、テストリーダーボードで最先端の結果を達成する。

実験結果

リサーチクエスチョン

RQ1変形可能な跨域注意機構は、画像と LiDAR の特徴間の融合品質を維持または向上させつつ計算コストを削減できるか。
RQ2Depth-aware GT-AUG は過度なアノテーションやフィルタリングなしに跨モーダルの同期と拡張の有効性を向上させるか。
RQ3画像レベル dropout トレーニング戦略は画像データの可用性が変動する場合に動的推論を可能にするか。
RQ4AutoAlignV2 は異なる 3D 検出器および nuScenes ベンチマークで既存の最先端と比較してどうか。
RQ5各コンポーネント（DeformCAFA、Depth-Aware GT-AUG、画像レベル dropout）の全体性能への寄与はどれか。

主な発見

方法	mAP	NDS
Object DGCNN	60.73	67.14
Object DGCNN	64.42	69.52
CenterPoint	62.56	68.84
CenterPoint	67.05	71.23

AutoAlignV2 はベース検出器を改善する：Object DGCNN は mAP が 60.73 から 64.42、NDS が 67.14 から 69.52 に向上；CenterPoint は mAP が 62.56 から 67.05、NDS が 68.84 から 71.23 にそれぞれ改善し、nuScenes バリデーションでの結果を向上させた。
nuScenes テストリーダーボードで、CenterPoint を用いた AutoAlignV2 は従来の手法を上回り、NDS が 72.4、mAP が 68.4 を達成し、施工車両、二輪車、および自転車カテゴリで 13.1–17.4 mAP の個別 gains を含む。
Cross-Domain DeformCAFA は PointPainting、MoCa、AutoAlign、および PointAugmenting の融合戦略を mAP および NDS で上回る。
DeformCAFA の ablation でクロスドメイン相互作用を用いたトークン生成（乗算）は最良の結果を与えた。
Depth-Aware GT-AUG は若干ながら一貫した gains を提供し、複雑なマスキングを回避し全体の改善に寄与した。
画像レベル dropout トレーニングは訓練を加速し、画像が部分的に欠損している場合でも高い精度を維持しつつ動的推論を可能にする。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。