QUICK REVIEW

[論文レビュー] VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

Zhe Wang, Siqi Fan|arXiv (Cornell University)|Mar 20, 2023

Advanced Neural Network Applications被引用数 8

ひとこと要約

tldr: VIMI introduces an intermediate-fusion framework for VIC3D that fuses multi-view vehicle and infrastructure camera features using Multi-scale Cross Attention and Camera-aware Channel Masking, with a Feature Compression module to reduce transmission costs, achieving state-of-the-art results on the DAIR-V2X-C benchmark.

ABSTRACT

In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.

研究の動機と目的

Motivate and address the challenges of VIC3D with time asynchrony and calibration noise during multi-view feature fusion.
Propose a single, end-to-end intermediate fusion framework (VIMI) that fuses vehicle and infrastructure camera features at the feature level.
Improve transmission efficiency by compressing infrastructure features before fusion.
Enhance fusion robustness and accuracy via camera-aware feature reweighting and multi-scale cross-attention.
Show state-of-the-art performance on the DAIR-V2X-C VIC3D benchmark with comparable transmission cost.

提案手法

Feature Compression (FC) to transmit compressed infrastructure features from infrastructure to vehicle.
Multi-scale Cross Attention (MCA) to fuse vehicle and infrastructure features across multiple scales and mitigate calibration noise.
Camera-aware Channel Masking (CCM) to reweight fused features using camera intrinsic/extrinsic priors.
Point-Sampling Voxel Fusion to project augmented features into a unified voxel volume and aggregate into BEV for 3D detection.
3D detection heads operating on BEV features with standard detection losses (bbox, cls, dir).

実験結果

リサーチクエスチョン

RQ1How can intermediate fusion of vehicle and infrastructure camera features improve camera-based VIC3D robustness to calibration noise and time asynchrony?
RQ2Does multi-scale cross-attention effectively select informative multi-view features for fusion?
RQ3Can camera priors be effectively integrated via channel masking to enhance feature fusion?
RQ4What are the transmission-efficiency benefits of feature compression for VIC3D without sacrificing detection performance?

主な発見

VIMI achieves state-of-the-art results on the DAIR-V2X-C VIC3D benchmark, outperforming both early fusion and late fusion methods with comparable transmission costs.
Ablation shows MCA and CCM each improve 3D and BEV detection metrics, with MCA providing a larger gain by selecting scale-aware infrastructure features.
FC reduces transmission load while providing feature refinement benefits, contributing to overall performance gains.
Voxel-level feature fusion (IF-Voxel) outperforms BEV-level fusion, indicating less information loss when fusing in 3D space.
VIMI demonstrates robustness to transmission noise, maintaining higher AP3D/APBEV than LF under increasing translation noise.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。