QUICK REVIEW

[论文解读] Multi-View Adaptive Fusion Network for 3D Object Detection

Guojun Wang, Bin Tian|arXiv (Cornell University)|Nov 2, 2020

Advanced Neural Network Applications参考文献 35被引用 24

一句话总结

该论文提出MVAF-Net，一种单阶段3D目标检测框架，通过注意力逐点融合（APF）模块和注意力逐点加权（APW）模块，融合激光雷达鸟瞰图（BEV）、距离图（RV）和相机图像。APF模块利用注意力机制自适应融合多视角特征，APW模块通过前景分类和中心回归任务增强特征学习，实现在KITTI数据集上具有优异速度-精度权衡的最先进性能。

ABSTRACT

3D object detection based on LiDAR-camera fusion is becoming an emerging research theme for autonomous driving. However, it has been surprisingly difficult to effectively fuse both modalities without information loss and interference. To solve this issue, we propose a single-stage multi-view fusion framework that takes LiDAR bird's-eye view, LiDAR range view and camera view images as inputs for 3D object detection. To effectively fuse multi-view features, we propose an attentive pointwise fusion (APF) module to estimate the importance of the three sources with attention mechanisms that can achieve adaptive fusion of multi-view features in a pointwise manner. Furthermore, an attentive pointwise weighting (APW) module is designed to help the network learn structure information and point feature importance with two extra tasks, namely, foreground classification and center regression, and the predicted foreground probability is used to reweight the point features. We design an end-to-end learnable network named MVAF-Net to integrate these two components. Our evaluations conducted on the KITTI 3D object detection datasets demonstrate that the proposed APF and APW modules offer significant performance gains. Moreover, the proposed MVAF-Net achieves the best performance among all single-stage fusion methods and outperforms most two-stage fusion methods, achieving the best trade-off between speed and accuracy on the KITTI benchmark.

研究动机与目标

为解决3D目标检测中激光雷达与相机数据间有效多模态融合的挑战，特别是避免信息丢失和干扰。
设计一种单阶段、端到端可学习的网络，充分利用鸟瞰图（BEV）、距离图（RV）和相机视图（CV）表示的互补优势。
通过在点级别使用注意力机制自适应估计各视角的重要性，实现特征的自适应融合。
通过基于前景概率的重加权机制和辅助任务学习结构信息，提升特征质量。
在KITTI基准上，相比现有单阶段和双阶段融合方法，实现更高的精度与推理速度表现。

提出的方法

该框架使用三流卷积神经网络主干网络，分别提取BEV、RV和CV输入的特征，其中激光雷达点在BEV和RV表示中被体素化。
注意力逐点融合（APF）模块为每个点在三个视角之间计算注意力权重，实现基于特征相关性的动态、自适应融合。
注意力逐点加权（APW）模块引入两个辅助任务——前景分类和中心回归，以学习结构信息，并利用预测的前景概率对点特征进行重加权。
融合并重加权后的特征再次被体素化，并输入检测头，以端到端方式实现3D目标预测。
网络通过多任务监督端到端训练，结合检测损失与APW组件的辅助损失。
特征可视化与消融实验验证了基于注意力的融合与重加权在抑制噪声和增强相关特征方面的有效性。

实验结果

研究问题

RQ1如何自适应融合来自激光雷达BEV、RV和相机图像的多视角特征，以最小化信息丢失与干扰？
RQ2在点级别使用注意力机制动态加权不同视角贡献的影响是什么？
RQ3前景分类和中心回归等辅助任务是否能提升3D目标检测中的特征表示与检测精度？
RQ4所提出的融合策略与现有单阶段和双阶段激光雷达-相机融合方法相比，在性能与效率上表现如何？
RQ5基于前景概率的特征重加权在远距离和小目标上的检测性能提升程度如何？

主要发现

所提出的APF模块在KITTI验证集上对'Car'检测的3D mAP达到89.35%，相比无APF基线模型提升1.62%。
APW模块对性能贡献显著，当所有组件均启用时，'Hard'集上的mAP相比基线提升1.44%。
消融实验表明，BEV表示在近距离最有效，而CV与RV特征在远距离被选择性使用，从而减少噪声。
可视化结果证实，APF模块在近距离有效抑制了噪声特征（如植被），并增强了BEV与RV中远距离目标的特征。
APW模块有效抑制了背景点特征，同时保留并增强了前景特征，可视化结果清晰显示了这一点。
MVAF-Net在所有单阶段融合方法中表现最佳，且优于大多数双阶段方法，在KITTI上实现了新的SOTA速度-精度权衡。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。