[论文解读] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving
tldr: LRHPerception 是一个实时单目感知包,融合端到端的效率与局部地图细节,可以在单个摄像头上以 29 FPS、单张 GPU 运行,提供 RGB、道路分割、深度、目标跟踪和轨迹预测。
Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.
研究动机与目标
- Motivate cost-efficient monocular autonomous driving perception on standard hardware.
- Propose LRHPerception to unify object tracking, trajectory prediction, road segmentation, and depth estimation from a single camera.
- Enable information sharing across modules to reduce redundant processing.
- Showcase real-time performance improvements over state-of-the-art local-mapping approaches.
提出的方法
- Use a Swin Transformer backbone to extract multi-scale features from RGB input.
- Share backbone features across modules to compute four tasks with one backbone.
- Introduce C-BYTE for camera-motion aware object tracking to improve data association.
- Employ a CVAE-based trajectory predictor with a GRU encoder/decoder for multi-modal futures.
- Implement a lightweight road segmentation block based on a simplified U-Net using Phi_8 features.
- Adopt a coarse-to-fine depth estimator with a coarse depth former and a refine depth former.
- Train modules on multiple datasets (cross-dataset training) with module-specific losses combined as L = λ_det L_det + λ_seg L_seg + λ_depth L_depth + λ_traj L_traj.

实验结果
研究问题
- RQ1Can monocular LRHPerception achieve real-time (FPS) performance on standard hardware while maintaining competitive perception accuracy across tasks (tracking, trajectory, segmentation, depth)?
- RQ2Does sharing a single backbone and integrated architecture reduce redundant computation compared to serial task pipelines?
- RQ3How do camera-motion corrections (C-BYTE) and multi-task integration affect tracking robustness and trajectory prediction accuracy?
- RQ4What are the gains in speed and accuracy for road segmentation and depth estimation when using the proposed lightweight blocks and coarse-refine depth design?
- RQ5Is cross-dataset training effective to jointly optimize a task-agnostic backbone for monocular perception?
主要发现
- LRHPerception achieves 29 FPS on a single RTX 3090 GPU for monocular perception.
- The method shows a 555% acceleration over the fastest local-mapping method.
- C-BYTE improves object tracking robustness by correcting camera motion in association, improving MOTA/IDF1/IDP with negligible delay (~<4 ms).
- Trajectory prediction via CVAE-based encoder and GRU-based decoder yields faster processing with competitive accuracy on JAAD and PIE datasets, outperforming recent methods in speed and accuracy.
- Road segmentation with a lightweight U-Net style block on Phi_8 features achieves high mIOU with superior speed relative to universal segmentation models.
- Depth estimation with a coarse-refine design using modified C2f layers delivers substantial speedups (e.g., 577% faster than a leading alternative) while maintaining accuracy.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。