QUICK REVIEW

[论文解读] VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

Zhe Wang, Siqi Fan|arXiv (Cornell University)|Mar 20, 2023

Advanced Neural Network Applications被引用 8

一句话总结

VIMI 引入一个用于 VIC3D 的中间融合框架，通过多尺度跨注意力和相机感知通道掩模融合多视角车辆与基建摄像机特征，配合特征压缩模块以降低传输成本，在 DAIR-V2X-C 基准上实现了最先进的结果。

ABSTRACT

In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.

研究动机与目标

在多视角特征融合时，动员并解决 VIC3D 由于时间同步与标定噪声带来的挑战。
提出一个单一端到端的中间融合框架（VIMI），在特征层面融合车辆与基建摄像头特征。
通过在融合前对基建特征进行压缩来提高传输效率。
通过相机感知的特征重新加权与多尺度跨注意力提升融合鲁棒性和准确性。
在 DAIR-V2X-C VIC3D 基准上展现与传输成本相当时的最先进性能。

提出的方法

特征压缩（FC），将压缩后的基建特征从基建端传输到车辆端。
多尺度跨注意力（MCA），在多尺度上融合车辆与基建特征并缓解标定噪声。
相机感知通道掩模（CCM），利用相机内外参数先验对融合后的特征进行再加权。
点采样体素融合，将增强后的特征投影到统一的体素体积并聚合到 BEV 以实现 3D 检测。
在 BEV 特征上进行的 3D 检测头，使用标准检测损失（bbox、cls、dir）。

实验结果

研究问题

RQ1如何通过车辆与基建摄像头特征的中间融合提高对 calibration 噪声和时间异步性的鲁棒性？
RQ2多尺度跨注意力是否能有效选择有信息量的多视角特征用于融合？
RQ3能否通过通道掩模有效整合相机先验以提升特征融合？
RQ4在不牺牲检测性能的前提下，特征压缩对 VIC3D 的传输效率有何提升？

主要发现

VIMI 在 DAIR-V2X-C VIC3D 基准上达到最先进的结果，超越早融合和晚融合方法，且传输成本相当。
消融研究表明 MCA 与 CCM 各自提升了 3D 与 BEV 检测指标，MCA 在选择尺度感知的基建特征方面带来更大增益。
FC 降低传输负载的同时带来特征精炼收益，对整体性能提升有贡献。
体素级特征融合（IF-Voxel）优于 BEV 级融合，表明在 3D 空间进行融合时信息损失较少。
VIMI 对传输噪声表现出鲁棒性，在平移噪声增加时仍维持高于 LF 的 AP3D/APBEV。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。