[論文レビュー] VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection
tldr: VIMI introduces an intermediate-fusion framework for VIC3D that fuses multi-view vehicle and infrastructure camera features using Multi-scale Cross Attention and Camera-aware Channel Masking, with a Feature Compression module to reduce transmission costs, achieving state-of-the-art results on the DAIR-V2X-C benchmark.
In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.
研究の動機と目的
- Motivate and address the challenges of VIC3D with time asynchrony and calibration noise during multi-view feature fusion.
- Propose a single, end-to-end intermediate fusion framework (VIMI) that fuses vehicle and infrastructure camera features at the feature level.
- Improve transmission efficiency by compressing infrastructure features before fusion.
- Enhance fusion robustness and accuracy via camera-aware feature reweighting and multi-scale cross-attention.
- Show state-of-the-art performance on the DAIR-V2X-C VIC3D benchmark with comparable transmission cost.
提案手法
- Feature Compression (FC) to transmit compressed infrastructure features from infrastructure to vehicle.
- Multi-scale Cross Attention (MCA) to fuse vehicle and infrastructure features across multiple scales and mitigate calibration noise.
- Camera-aware Channel Masking (CCM) to reweight fused features using camera intrinsic/extrinsic priors.
- Point-Sampling Voxel Fusion to project augmented features into a unified voxel volume and aggregate into BEV for 3D detection.
- 3D detection heads operating on BEV features with standard detection losses (bbox, cls, dir).
実験結果
リサーチクエスチョン
- RQ1How can intermediate fusion of vehicle and infrastructure camera features improve camera-based VIC3D robustness to calibration noise and time asynchrony?
- RQ2Does multi-scale cross-attention effectively select informative multi-view features for fusion?
- RQ3Can camera priors be effectively integrated via channel masking to enhance feature fusion?
- RQ4What are the transmission-efficiency benefits of feature compression for VIC3D without sacrificing detection performance?
主な発見
- VIMI achieves state-of-the-art results on the DAIR-V2X-C VIC3D benchmark, outperforming both early fusion and late fusion methods with comparable transmission costs.
- Ablation shows MCA and CCM each improve 3D and BEV detection metrics, with MCA providing a larger gain by selecting scale-aware infrastructure features.
- FC reduces transmission load while providing feature refinement benefits, contributing to overall performance gains.
- Voxel-level feature fusion (IF-Voxel) outperforms BEV-level fusion, indicating less information loss when fusing in 3D space.
- VIMI demonstrates robustness to transmission noise, maintaining higher AP3D/APBEV than LF under increasing translation noise.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。