QUICK REVIEW

[论文解读] TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with Transformers

Xuyang Bai, Zeyu Hu|arXiv (Cornell University)|Mar 22, 2022

Advanced Neural Network Applications被引用 28

一句话总结

TransFusion 引入了基于 Transformer 的 LiDAR-相机融合方法，使用软注意力和图像引导的查询初始化，在图像质量下降和校准误差时鲁棒地检测三维目标。它在 nuScenes 上实现了最先进的结果并扩展到三维跟踪。

ABSTRACT

LiDAR and camera are two important sensors for 3D object detection in autonomous driving. Despite the increasing popularity of sensor fusion in this field, the robustness against inferior image conditions, e.g., bad illumination and sensor misalignment, is under-explored. Existing fusion methods are easily affected by such conditions, mainly due to a hard association of LiDAR points and image pixels, established by calibration matrices. We propose TransFusion, a robust solution to LiDAR-camera fusion with a soft-association mechanism to handle inferior image conditions. Specifically, our TransFusion consists of convolutional backbones and a detection head based on a transformer decoder. The first layer of the decoder predicts initial bounding boxes from a LiDAR point cloud using a sparse set of object queries, and its second decoder layer adaptively fuses the object queries with useful image features, leveraging both spatial and contextual relationships. The attention mechanism of the transformer enables our model to adaptively determine where and what information should be taken from the image, leading to a robust and effective fusion strategy. We additionally design an image-guided query initialization strategy to deal with objects that are difficult to detect in point clouds. TransFusion achieves state-of-the-art performance on large-scale datasets. We provide extensive experiments to demonstrate its robustness against degenerated image quality and calibration errors. We also extend the proposed method to the 3D tracking task and achieve the 1st place in the leaderboard of nuScenes tracking, showing its effectiveness and generalization capability.

研究动机与目标

在图像条件下降级和传感器校准误差下，研究 LiDAR-相机融合的鲁棒性挑战。
提出一个基于 Transformer 的融合检测器，在 LiDAR 查询与图像特征之间执行软关联。
开发输入相关、类别感知的对象查询，以提升初始边界框预测。
引入图像引导的查询初始化和局部性偏置的交叉注意机制以增强融合。
在 nuScenes 展示最先进的 3D 检测，并在 Waymo 取得有竞争力的结果，同时具备跟踪能力。

提出的方法

使用两层 Transformer 解码器作为检测头，第一层利用稀疏对象查询从 LiDAR 特征预测初始三维框。
通过对象查询与图像特征记忆库之间的交叉注意实现软关联融合，由 SMCA 在空间局部性上进行引导。
引入图像引导的查询初始化，将 LiDAR BEV 与折叠后的图像特征融合以初始化查询。
使对象查询具有输入相关性和类别感知性，使用类别嵌入来为上下文推理提供信息。
分两阶段训练：先仅使用 LiDAR 预测初始框，然后进行 LiDAR-相机融合与查询初始化以进行细化。
使用基于匈牙利二部匹配的损失进行优化，结合分类、回归和 IoU 项。

实验结果

研究问题

RQ1如何使 LiDAR-相机融合对较差图像质量和传感器标定误差更具鲁棒性？
RQ2基于 Transformer 的融合头，采用软关联，是否在三维目标检测中优于硬关联融合方法？
RQ3输入相关、类别感知的对象查询以及图像引导的初始化对初始提议质量带来哪些改进？
RQ4局部性偏置的交叉注意（SMCA）如何影响融合的有效性和鲁棒性？
RQ5该方法是否能推广到超出单帧检测的三维跟踪任务？

主要发现

TransFusion 在 nuScenes 上的三维检测性能超越了此前方法，达到最新水平。
两阶段 Transformer 解码器实现了基于 LiDAR 的初始预测和对图像特征的自适应融合，提升了准确性。
通过带 SMCA 的交叉注意的软关联融合提高了对降质图像质量和标定误差的鲁棒性。
图像引导的查询初始化有助于在稀疏 LiDAR 数据中检测难以发现的对象。
该方法扩展到三维跟踪并在 nuScenes 跟踪排行榜上取得领先地位。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。