QUICK REVIEW

[论文解读] Single-Eye View: Monocular Real-time Perception Package for Autonomous Driving

Haixi Zhang, Aiyinsi Zuo|arXiv (Cornell University)|Mar 22, 2026

Advanced Vision and Imaging被引用 0

一句话总结

tldr: LRHPerception 是一个实时单目感知包，融合端到端的效率与局部地图细节，可以在单个摄像头上以 29 FPS、单张 GPU 运行，提供 RGB、道路分割、深度、目标跟踪和轨迹预测。

ABSTRACT

Amidst the rapid advancement of camera-based autonomous driving technology, effectiveness is often prioritized with limited attention to computational efficiency. To address this issue, this paper introduces LRHPerception, a real-time monocular perception package for autonomous driving that uses single-view camera video to interpret the surrounding environment. The proposed system combines the computational efficiency of end-to-end learning with the rich representational detail of local mapping methodologies. With significant improvements in object tracking and prediction, road segmentation, and depth estimation integrated into a unified framework, LRHPerception processes monocular image data into a five-channel tensor consisting of RGB, road segmentation, and pixel-level depth estimation, augmented with object detection and trajectory prediction. Experimental results demonstrate strong performance, achieving real-time processing at 29 FPS on a single GPU, representing a 555% speedup over the fastest mapping-based approach.

研究动机与目标

Motivate cost-efficient monocular autonomous driving perception on standard hardware.
Propose LRHPerception to unify object tracking, trajectory prediction, road segmentation, and depth estimation from a single camera.
Enable information sharing across modules to reduce redundant processing.
Showcase real-time performance improvements over state-of-the-art local-mapping approaches.

提出的方法

Use a Swin Transformer backbone to extract multi-scale features from RGB input.
Share backbone features across modules to compute four tasks with one backbone.
Introduce C-BYTE for camera-motion aware object tracking to improve data association.
Employ a CVAE-based trajectory predictor with a GRU encoder/decoder for multi-modal futures.
Implement a lightweight road segmentation block based on a simplified U-Net using Phi_8 features.
Adopt a coarse-to-fine depth estimator with a coarse depth former and a refine depth former.
Train modules on multiple datasets (cross-dataset training) with module-specific losses combined as L = λ_det L_det + λ_seg L_seg + λ_depth L_depth + λ_traj L_traj.

Figure 1 : Innovation and architecture blueprint a) Paradigm of end-to-end solution b) Paradigm of camera-fusion for local map solution c) Paradigm of our LRHPerception package, extracts essences from monocular camera for cost-info trade-off.

实验结果

研究问题

RQ1Can monocular LRHPerception achieve real-time (FPS) performance on standard hardware while maintaining competitive perception accuracy across tasks (tracking, trajectory, segmentation, depth)?
RQ2Does sharing a single backbone and integrated architecture reduce redundant computation compared to serial task pipelines?
RQ3How do camera-motion corrections (C-BYTE) and multi-task integration affect tracking robustness and trajectory prediction accuracy?
RQ4What are the gains in speed and accuracy for road segmentation and depth estimation when using the proposed lightweight blocks and coarse-refine depth design?
RQ5Is cross-dataset training effective to jointly optimize a task-agnostic backbone for monocular perception?

主要发现

LRHPerception achieves 29 FPS on a single RTX 3090 GPU for monocular perception.
The method shows a 555% acceleration over the fastest local-mapping method.
C-BYTE improves object tracking robustness by correcting camera motion in association, improving MOTA/IDF1/IDP with negligible delay (~<4 ms).
Trajectory prediction via CVAE-based encoder and GRU-based decoder yields faster processing with competitive accuracy on JAAD and PIE datasets, outperforming recent methods in speed and accuracy.
Road segmentation with a lightweight U-Net style block on Phi_8 features achieves high mIOU with superior speed relative to universal segmentation models.
Depth estimation with a coarse-refine design using modified C2f layers delivers substantial speedups (e.g., 577% faster than a leading alternative) while maintaining accuracy.

Figure 2 : Granular Model Structure.1 Design of convolution decoder, object tracking, trajectory prediction, and depth estimation; magnify for details. BTAE mechanism in Algorithm 1. Remaining components are shown in Fig. 3.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。