QUICK REVIEW

[Paper Review] DeepDriving: Learning Affordance for Direct Perception in Autonomous Driving

Chenyi Chen, Ari Seff|arXiv (Cornell University)|May 1, 2015

Autonomous Vehicle Technology and Safety15 references154 citations

TL;DR

This paper proposes DeepDriving, a direct perception framework that uses a deep Convolutional Neural Network to estimate key driving affordances—such as distance to nearby vehicles and lane position—directly from raw images, bypassing full scene parsing or end-to-end action regression. The method achieves state-of-the-art performance on KITTI for distance estimation with a mean absolute error of 5.832m in the y-direction, demonstrating strong generalization to real-world driving scenes.

ABSTRACT

Today, there are two major paradigms for vision-based autonomous driving systems: mediated perception approaches that parse an entire scene to make a driving decision, and behavior reflex approaches that directly map an input image to a driving action by a regressor. In this paper, we propose a third paradigm: a direct perception approach to estimate the affordance for driving. We propose to map an input image to a small number of key perception indicators that directly relate to the affordance of a road/traffic state for driving. Our representation provides a set of compact yet complete descriptions of the scene to enable a simple controller to drive autonomously. Falling in between the two extremes of mediated perception and behavior reflex, we argue that our direct perception representation provides the right level of abstraction. To demonstrate this, we train a deep Convolutional Neural Network using recording from 12 hours of human driving in a video game and show that our model can work well to drive a car in a very diverse set of virtual environments. We also train a model for car distance estimation on the KITTI dataset. Results show that our direct perception approach can generalize well to real driving images. Source code and data are available on our project website.

Motivation & Objective

To address the limitations of mediated perception (excessive scene parsing) and behavior reflex (direct image-to-action mapping) in autonomous driving.
To propose a middle-ground paradigm—direct perception—that estimates key driving affordances without full scene understanding.
To develop a compact, task-specific representation that enables simple control while maintaining robustness and generalization.
To train a deep CNN on human-driven video data to learn direct mappings from images to driving-relevant indicators.
To evaluate performance on both synthetic (TORCS) and real-world (KITTI) driving datasets, showing generalization to real images.

Proposed method

Train a deep Convolutional Neural Network (CNN) on 12 hours of human-driven video from a racing game (TORCS) to regress key driving affordances: distance to nearest vehicle in x and y, and Euclidean distance.
Use a fully connected layer to extract a 4,096-dimensional intermediate representation that encodes scene features relevant to driving decisions.
Visualize neuron activation patterns and response maps to interpret what features the network learns—such as lane markings, vehicle positions, and host car heading.
Compare performance against a DPM-based mediated perception baseline using projection for distance estimation, with and without false positive penalties.
Apply the same network architecture to the KITTI dataset for real-world distance estimation, using ground truth from calibrated sensors.
Use mean absolute error (MAE) to evaluate performance, with false positives penalized in some metrics to ensure fairness.

Experimental results

Research questions

RQ1Can a deep CNN learn to estimate key driving affordances (e.g., distance to nearest vehicle) directly from raw images without full scene parsing?
RQ2Does the proposed direct perception approach generalize to real-world driving data, such as the KITTI dataset?
RQ3How does the performance of direct perception compare to mediated perception baselines that use object detection and geometric projection?
RQ4To what extent do the learned features in the CNN correspond to meaningful driving-relevant structures like lane markings and nearby vehicles?
RQ5Can the model handle challenging scenarios such as partially visible vehicles or uneven terrain, where traditional projection-based methods fail?

Key findings

The proposed direct perception model achieves a mean absolute error (MAE) of 5.832 meters in predicting the y-coordinate (forward distance) to the nearest vehicle on the KITTI dataset.
The model’s MAE for x-coordinate (lateral distance) is 1.565 meters, and for Euclidean distance (d), it is 6.299 meters, showing strong performance on real-world data.
When false positives are not penalized, the model’s error drops significantly (e.g., 4.669m for d), indicating more accurate estimations on true positives than the DPM-based baseline.
Visualization of neuron activations reveals strong correlations with lane markings, vehicle positions, and host car heading, confirming the network learns task-specific features.
The response maps from the 4th convolutional layer show strong activation over nearby vehicles and lane markings, indicating the network learns to attend to relevant regions for affordance estimation.
The model generalizes well to real-world images despite being trained on synthetic video, and outperforms a DPM-based projection method, especially when false positives are excluded.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.