QUICK REVIEW

[论文解读] AutoAlignV2: Deformable Feature Aggregation for Dynamic Multi-Modal 3D Object Detection

Zehui Chen, Zhenyu Li|arXiv (Cornell University)|Jul 21, 2022

Advanced Neural Network Applications被引用 20

一句话总结

AutoAlignV2 引入跨域 DeformCAFA，以高效融合 2D 图像特征与 LiDAR，实现具深度感知数据增强和图像级 dropout 训练的动态、多层次多模态 3D 检测。

ABSTRACT

Point clouds and RGB images are two general perceptional sources in autonomous driving. The former can provide accurate localization of objects, and the latter is denser and richer in semantic information. Recently, AutoAlign presents a learnable paradigm in combining these two modalities for 3D object detection. However, it suffers from high computational cost introduced by the global-wise attention. To solve the problem, we propose Cross-Domain DeformCAFA module in this work. It attends to sparse learnable sampling points for cross-modal relational modeling, which enhances the tolerance to calibration error and greatly speeds up the feature aggregation across different modalities. To overcome the complex GT-AUG under multi-modal settings, we design a simple yet effective cross-modal augmentation strategy on convex combination of image patches given their depth information. Moreover, by carrying out a novel image-level dropout training scheme, our model is able to infer in a dynamic manner. To this end, we propose AutoAlignV2, a faster and stronger multi-modal 3D detection framework, built on top of AutoAlign. Extensive experiments on nuScenes benchmark demonstrate the effectiveness and efficiency of AutoAlignV2. Notably, our best model reaches 72.4 NDS on nuScenes test leaderboard, achieving new state-of-the-art results among all published multi-modal 3D object detectors. Code will be available at https://github.com/zehuichen123/AutoAlignV2.

研究动机与目标

推动 RGB 图像与 LiDAR 在 3D 目标检测中的融合改进。
解决先前跨模态融合中的低效和大规模全局注意力问题。
提出基于可变形、稀疏采样的跨域融合以减少计算。
简化多模态数据增强以在无需大量遮罩的情况下保持同步。
实现可在有无图像输入情况下进行动态推理，以适应实际系统。

提出的方法

提出 Cross-Domain DeformCAFA，使用可学习的采样偏移来对少量图像点进行跨模态融合的关注。
通过相机-LiDAR 投影从体素中心计算参考点，并在 M 个头上应用带有 K 个采样位置的可变形跨注意力。
引入跨域标记生成，将特征分解为域特定和实例特定分量，以实现更好的跨模态交互。
深度感知 GT-AUG 以深度顺序混合图像补丁，在不需要复杂遮罩或过滤的情况下保持同步。
图像级 dropout 训练，使得在有无图像输入时均可进行自发推理，从而提升训练速度与鲁棒性。
在 nuScenes 上使用 CenterPoint 和 Object DGCNN 基线进行评估，在测试排行榜上达到最先进的结果。

实验结果

研究问题

RQ1可变形的跨域注意力机制在降低计算成本的同时，是否能维持或提升图像和 LiDAR 特征之间的融合质量？
RQ2深度感知的 GT-AUG 是否能在无需大量标注或过滤的情况下改善跨模态同步与增强效果？
RQ3图像级 dropout 训练策略是否能在图像数据可用性不同的情况下实现动态推理？
RQ4与现有最先进方法相比，AutoAlignV2 在不同的 3D 检测器及 nuScenes 基准上的表现如何？
RQ5每个组件（DeformCAFA、Depth-Aware GT-AUG、图像级 dropout）对总体性能的贡献是什么？

主要发现

方法	mAP	NDS
Object DGCNN	60.73	67.14
Object DGCNN	64.42	69.52
CenterPoint	62.56	68.84
CenterPoint	67.05	71.23

AutoAlignV2 提升基础检测器：Object DGCNN 的 mAP 从 60.73 提升到 64.42，NDS 从 67.14 提升到 69.52；CenterPoint 的 mAP 从 62.56 提升到 67.05，NDS 从 68.84 提升到 71.23，在 nuScenes 验证集上。
在 nuScenes 测试排行榜上，使用 CenterPoint 的 AutoAlignV2 超越了先前的方法，达到 NDS 72.4 和 mAP 68.4，且在构建车辆、摩托车和自行车等类别的逐类提升为 13.1–17.4 mAP。
Cross-Domain DeformCAFA 在 mAP 和 NDS 上超越 PointPainting、MoCa、AutoAlign 与 PointAugmenting 等融合策略。
在 DeformCAFA 的消融实验中，使用跨域交互并进行乘法的标记生成获得最佳结果。
深度感知 GT-AUG 提供了轻微但稳定的提升，且避免了复杂遮罩，对总体提升有贡献。
图像级 dropout 训练加速训练，并在不牺牲精度的情况下实现动态推理，当图像部分缺失时仍表现稳健。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。