Skip to main content
QUICK REVIEW

[论文解读] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving

Haixi Zhang, Aiyinsi Zuo|arXiv (Cornell University)|Mar 22, 2026
Advanced Vision and Imaging被引用 0
一句话总结

tldr: LRHPerception 是一个实时单目感知包,融合端到端的效率与局部地图细节,可以在单个摄像头上以 29 FPS、单张 GPU 运行,提供 RGB、道路分割、深度、目标跟踪和轨迹预测。

ABSTRACT

Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.

研究动机与目标

  • Motivate cost-efficient monocular autonomous driving perception on standard hardware.
  • Propose LRHPerception to unify object tracking, trajectory prediction, road segmentation, and depth estimation from a single camera.
  • Enable information sharing across modules to reduce redundant processing.
  • Showcase real-time performance improvements over state-of-the-art local-mapping approaches.

提出的方法

  • Use a Swin Transformer backbone to extract multi-scale features from RGB input.
  • Share backbone features across modules to compute four tasks with one backbone.
  • Introduce C-BYTE for camera-motion aware object tracking to improve data association.
  • Employ a CVAE-based trajectory predictor with a GRU encoder/decoder for multi-modal futures.
  • Implement a lightweight road segmentation block based on a simplified U-Net using Phi_8 features.
  • Adopt a coarse-to-fine depth estimator with a coarse depth former and a refine depth former.
  • Train modules on multiple datasets (cross-dataset training) with module-specific losses combined as L = λ_det L_det + λ_seg L_seg + λ_depth L_depth + λ_traj L_traj.
Figure 1 : Innovation and architecture blueprint a) Paradigm of end-to-end solution b) Paradigm of camera-fusion for local map solution c) Paradigm of our LRHPerception package, extracts essences from monocular camera for cost-info trade-off.
Figure 1 : Innovation and architecture blueprint a) Paradigm of end-to-end solution b) Paradigm of camera-fusion for local map solution c) Paradigm of our LRHPerception package, extracts essences from monocular camera for cost-info trade-off.

实验结果

研究问题

  • RQ1Can monocular LRHPerception achieve real-time (FPS) performance on standard hardware while maintaining competitive perception accuracy across tasks (tracking, trajectory, segmentation, depth)?
  • RQ2Does sharing a single backbone and integrated architecture reduce redundant computation compared to serial task pipelines?
  • RQ3How do camera-motion corrections (C-BYTE) and multi-task integration affect tracking robustness and trajectory prediction accuracy?
  • RQ4What are the gains in speed and accuracy for road segmentation and depth estimation when using the proposed lightweight blocks and coarse-refine depth design?
  • RQ5Is cross-dataset training effective to jointly optimize a task-agnostic backbone for monocular perception?

主要发现

  • LRHPerception achieves 29 FPS on a single RTX 3090 GPU for monocular perception.
  • The method shows a 555% acceleration over the fastest local-mapping method.
  • C-BYTE improves object tracking robustness by correcting camera motion in association, improving MOTA/IDF1/IDP with negligible delay (~<4 ms).
  • Trajectory prediction via CVAE-based encoder and GRU-based decoder yields faster processing with competitive accuracy on JAAD and PIE datasets, outperforming recent methods in speed and accuracy.
  • Road segmentation with a lightweight U-Net style block on Phi_8 features achieves high mIOU with superior speed relative to universal segmentation models.
  • Depth estimation with a coarse-refine design using modified C2f layers delivers substantial speedups (e.g., 577% faster than a leading alternative) while maintaining accuracy.
Figure 2 : Granular Model Structure.1 Design of convolution decoder, object tracking, trajectory prediction, and depth estimation; magnify for details. BTAE mechanism in Algorithm 1. Remaining components are shown in Fig. 3.
Figure 2 : Granular Model Structure.1 Design of convolution decoder, object tracking, trajectory prediction, and depth estimation; magnify for details. BTAE mechanism in Algorithm 1. Remaining components are shown in Fig. 3.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。