QUICK REVIEW

[论文解读] Deep Steering: Learning End-to-End Driving Model from Spatial and Temporal Visual Cues

Lu Chi, Yadong Mu|arXiv (Cornell University)|Aug 12, 2017

Autonomous Vehicle Technology and Safety参考文献 2被引用 88

一句话总结

本论文提出一种端到端的基于视觉的转向模型，利用时空卷积和 Conv-LSTM 融合时空线索，在真实的人工驾驶数据上进行训练，并提供可解释性的可视化。

ABSTRACT

In recent years, autonomous driving algorithms using low-cost vehicle-mounted cameras have attracted increasing endeavors from both academia and industry. There are multiple fronts to these endeavors, including object detection on roads, 3-D reconstruction etc., but in this work we focus on a vision-based model that directly maps raw input images to steering angles using deep networks. This represents a nascent research topic in computer vision. The technical contributions of this work are three-fold. First, the model is learned and evaluated on real human driving videos that are time-synchronized with other vehicle sensors. This differs from many prior models trained from synthetic data in racing games. Second, state-of-the-art models, such as PilotNet, mostly predict the wheel angles independently on each video frame, which contradicts common understanding of driving as a stateful process. Instead, our proposed model strikes a combination of spatial and temporal cues, jointly investigating instantaneous monocular camera observations and vehicle's historical states. This is in practice accomplished by inserting carefully-designed recurrent units (e.g., LSTM and Conv-LSTM) at proper network layers. Third, to facilitate the interpretability of the learned model, we utilize a visual back-propagation scheme for discovering and visualizing image regions crucially influencing the final steering prediction. Our experimental study is based on about 6 hours of human driving data provided by Udacity. Comprehensive quantitative evaluations demonstrate the effectiveness and robustness of our model, even under scenarios like drastic lighting changes and abrupt turning. The comparison with other state-of-the-art models clearly reveals its superior performance in predicting the due wheel angle for a self-driving car.

研究动机与目标

推动学习一个基于视觉的自动驾驶转向模型，该模型在真实的人类驾驶日志上训练，而非合成数据。
通过在网络的多个层中使用循环单元，将时间依赖性纳入转向预测。
开发一个特征提取子网络，使用时空卷积和多尺度残差聚合捕捉时空信息。
将转向预测子网络与时序融合集成，以产生平滑且准确的车轮角度预测。
通过可视化反向传播提供可解释性，以识别影响转向决策的图像区域。

提出的方法

使用具备时空卷积（ST-Conv）和多尺度残差聚合的特征提取子网络来产生 128 维特征。
引入 ConvLSTM 在建模跨帧的时序动态时保持空间结构。
应用一个包含三个递归的转向预测子网络，其中包括一个 LSTM，用于在提取特征的同时聚合先前的速度、扭矩和车轮角度。
采用多任务目标进行训练，将转向、速度和扭矩损失结合在一起，对转向的权重较高（γ=10）。
对车轮角度进行标准化，并通过镜像数据增强来提高泛化能力。
在包含同步 GPS、速度、扭矩和车轮角度注释的真实 Udacity 驾驶数据上进行训练和评估。

实验结果

研究问题

RQ1一个基于视觉的模型是否能够从真实的、时间同步的驾驶数据中学习到准确的连续转向角？
RQ2在网络的多层引入时序信息是否比逐帧方法能提高转向预测的准确性？
RQ3ST-Conv 与 ConvLSTM 如何有助于捕捉用于转向的时空线索？
RQ4数据增强（镜像）和关键帧减少是否影响模型性能和泛化能力？
RQ5哪些可视化技术能够揭示哪些图像区域对转向决策有影响？

主要发现

在 Udacity 驾驶数据集的测试架构中，所提出的 Deep Steering 模型实现了最低的 RMSE（0.0637）。
通过 ST-Conv、ConvLSTM 和多层递归引入时序信息，获得比仅帧的模型（如 PilotNet 或 VGG-16）更平滑、更准确的转向。
通过镜像的数据增强在各阶段都提高了 RMSE，证实了泛化收益。
采用包含转向、速度和扭矩损失的多任务目标可提升转向性能，其中对转向的权重更高（γ=10）。
视觉反向传播实现对影响转向决策的图像区域的可解释定位。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。