Skip to main content
QUICK REVIEW

[論文レビュー] Sparse4D v2: Recurrent Temporal Fusion with Sparse Model

Xuewu Lin, Tianwei Lin|arXiv (Cornell University)|May 23, 2023
Advanced Image Fusion Techniques被引用数 20
ひとこと要約

Sparse4D v2 は、Sparse4D v2 introduces a recurrent temporal fusion module for sparse multi-view 3D detection, reducing temporal fusion from O(T) to O(1) and achieving state-of-the-art results on nuScenes, with Efficient Deformable Aggregation and camera-parameter encoding improving efficiency and robustness.

ABSTRACT

Sparse algorithms offer great flexibility for multi-view temporal perception tasks. In this paper, we present an enhanced version of Sparse4D, in which we improve the temporal fusion module by implementing a recursive form of multi-frame feature sampling. By effectively decoupling image features and structured anchor features, Sparse4D enables a highly efficient transformation of temporal features, thereby facilitating temporal fusion solely through the frame-by-frame transmission of sparse features. The recurrent temporal fusion approach provides two main benefits. Firstly, it reduces the computational complexity of temporal fusion from $O(T)$ to $O(1)$, resulting in significant improvements in inference speed and memory usage. Secondly, it enables the fusion of long-term information, leading to more pronounced performance improvements due to temporal fusion. Our proposed approach, Sparse4Dv2, further enhances the performance of the sparse perception algorithm and achieves state-of-the-art results on the nuScenes 3D detection benchmark. Code will be available at \url{https://github.com/linxuewu/Sparse4D}.

研究の動機と目的

  • Motivate improved sparse-based perception for autonomous driving by enabling efficient long-term temporal fusion.
  • Develop a recurrent temporal fusion mechanism that decouples image features from instance states to reduce computation.
  • Enhance training and robustness through camera-parameter encoding and dense depth supervision.
  • Improve memory efficiency and inference speed via an optimized deformable aggregation operation.
  • Demonstrate state-of-the-art performance on nuScenes compared to BEV-based and other sparse methods.

提案手法

  • Replace multi-frame sampling with a recurrent temporal propagation of instance features across frames.
  • Decouple image features and instance states using anchor, instance feature, and anchor embedding, with a lightweight anchor encoder Psi for temporal projection.
  • Propagate anchors through ego-motion to current frame and re-encode their position embeddings for temporal cross-attention in the decoder.
  • Introduce Efficient Deformable Aggregation (EDA) to fuse multi-view, multi-scale features as a single CUDA operation, reducing memory and increasing speed.
  • Incorporate camera parameter encoding directly into the view-weight computation to improve robustness to camera variations.
  • Add dense depth supervision from LiDAR point clouds to stabilize and accelerate training, with a later removal of the depth-reweight module.
(a) Multi-frame Sampling and Fusion
(a) Multi-frame Sampling and Fusion

実験結果

リサーチクエスチョン

  • RQ1How can temporal fusion for sparse multi-view 3D detection be made independent of the number of historical frames to improve speed and memory usage?
  • RQ2Can instance-level temporal propagation with ego-motion-based anchoring replace multi-frame sampling without sacrificing accuracy?
  • RQ3Does explicitly encoding camera parameters and incorporating dense depth supervision improve robustness and detection performance across viewpoints and scenarios?
  • RQ4What are the efficiency and memory benefits of a redesigned deformable aggregation operator in sparse temporal fusion?

主な発見

  • Sparse4Dv2 achieves faster inference and lower memory usage than Sparse4Dv1, with substantial gains across frames (example: FPS and GPU memory improvements on RTX 3090).
  • The recurrent temporal fusion enables long-term information fusion without increasing the number of anchors, maintaining comparable inference speed to non-temporal models.
  • Efficient Deformable Aggregation (EDA) reduces training memory by about half and increases training batch size and overall speed, while also boosting inference FPS by about 42%.
  • Camera parameter encoding improves orientation and overall perception metrics; removing it degrades mAP and mAOE.
  • Dense depth supervision significantly boosts performance (e.g., reductions in gradient collapse and improved mAP/NDS when enabled).
  • On nuScenes validation with ResNet50 and 256x704 input, Sparse4Dv2 achieves mAP 0.439 and NDS 0.539, outperforming several BEV-based and sparse baselines; on test, Sparse4Dv2 with VovNet-99 reaches mAP 0.557 and NDS 0.638, indicating SOTA potential.
(b) Recurrent Temporal Fusion
(b) Recurrent Temporal Fusion

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。