Skip to main content
QUICK REVIEW

[論文レビュー] VIMI: Vehicle-Infrastructure Multi-view Intermediate Fusion for Camera-based 3D Object Detection

Zhe Wang, Siqi Fan|arXiv (Cornell University)|Mar 20, 2023
Advanced Neural Network Applications被引用数 8
ひとこと要約

tldr: VIMI introduces an intermediate-fusion framework for VIC3D that fuses multi-view vehicle and infrastructure camera features using Multi-scale Cross Attention and Camera-aware Channel Masking, with a Feature Compression module to reduce transmission costs, achieving state-of-the-art results on the DAIR-V2X-C benchmark.

ABSTRACT

In autonomous driving, Vehicle-Infrastructure Cooperative 3D Object Detection (VIC3D) makes use of multi-view cameras from both vehicles and traffic infrastructure, providing a global vantage point with rich semantic context of road conditions beyond a single vehicle viewpoint. Two major challenges prevail in VIC3D: 1) inherent calibration noise when fusing multi-view images, caused by time asynchrony across cameras; 2) information loss when projecting 2D features into 3D space. To address these issues, We propose a novel 3D object detection framework, Vehicles-Infrastructure Multi-view Intermediate fusion (VIMI). First, to fully exploit the holistic perspectives from both vehicles and infrastructure, we propose a Multi-scale Cross Attention (MCA) module that fuses infrastructure and vehicle features on selective multi-scales to correct the calibration noise introduced by camera asynchrony. Then, we design a Camera-aware Channel Masking (CCM) module that uses camera parameters as priors to augment the fused features. We further introduce a Feature Compression (FC) module with channel and spatial compression blocks to reduce the size of transmitted features for enhanced efficiency. Experiments show that VIMI achieves 15.61% overall AP_3D and 21.44% AP_BEV on the new VIC3D dataset, DAIR-V2X-C, significantly outperforming state-of-the-art early fusion and late fusion methods with comparable transmission cost.

研究の動機と目的

  • Motivate and address the challenges of VIC3D with time asynchrony and calibration noise during multi-view feature fusion.
  • Propose a single, end-to-end intermediate fusion framework (VIMI) that fuses vehicle and infrastructure camera features at the feature level.
  • Improve transmission efficiency by compressing infrastructure features before fusion.
  • Enhance fusion robustness and accuracy via camera-aware feature reweighting and multi-scale cross-attention.
  • Show state-of-the-art performance on the DAIR-V2X-C VIC3D benchmark with comparable transmission cost.

提案手法

  • Feature Compression (FC) to transmit compressed infrastructure features from infrastructure to vehicle.
  • Multi-scale Cross Attention (MCA) to fuse vehicle and infrastructure features across multiple scales and mitigate calibration noise.
  • Camera-aware Channel Masking (CCM) to reweight fused features using camera intrinsic/extrinsic priors.
  • Point-Sampling Voxel Fusion to project augmented features into a unified voxel volume and aggregate into BEV for 3D detection.
  • 3D detection heads operating on BEV features with standard detection losses (bbox, cls, dir).

実験結果

リサーチクエスチョン

  • RQ1How can intermediate fusion of vehicle and infrastructure camera features improve camera-based VIC3D robustness to calibration noise and time asynchrony?
  • RQ2Does multi-scale cross-attention effectively select informative multi-view features for fusion?
  • RQ3Can camera priors be effectively integrated via channel masking to enhance feature fusion?
  • RQ4What are the transmission-efficiency benefits of feature compression for VIC3D without sacrificing detection performance?

主な発見

  • VIMI achieves state-of-the-art results on the DAIR-V2X-C VIC3D benchmark, outperforming both early fusion and late fusion methods with comparable transmission costs.
  • Ablation shows MCA and CCM each improve 3D and BEV detection metrics, with MCA providing a larger gain by selecting scale-aware infrastructure features.
  • FC reduces transmission load while providing feature refinement benefits, contributing to overall performance gains.
  • Voxel-level feature fusion (IF-Voxel) outperforms BEV-level fusion, indicating less information loss when fusing in 3D space.
  • VIMI demonstrates robustness to transmission noise, maintaining higher AP3D/APBEV than LF under increasing translation noise.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。