QUICK REVIEW

[Paper Review] TADPO: Reinforcement Learning Goes Off-road

Zhouchonghao Wu, Raymond Song|arXiv (Cornell University)|Mar 6, 2026

Reinforcement Learning in Robotics0 citations

TL;DR

TADPO extends PPO with teacher action distillation to learn from demonstrations while exploring, enabling end-to-end vision-based off-road control and zero-shot sim-to-real transfer on a full-scale vehicle.

ABSTRACT

Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.

Motivation & Objective

Address the challenge of long-horizon, low-signal reinforcement learning for off-road autonomous driving.
Develop a teacher-guided RL framework that combines demonstrations with on-policy learning.
Enable end-to-end vision-based control capable of navigating diverse, unmapped terrains and obstacles.

Proposed method

Introduce TADPO, a policy gradient extension of PPO that learns from fixed demonstrations and on-policy rollouts concurrently.
Define L_TADPO losses to distill teacher actions via a constrained ratio (rho) and a positive-advantage condition, ensuring updates only when the teacher outperforms the student and not when the student already imitates the teacher.
Allow teacher and student to operate on potentially different observation spaces to accommodate privileged demonstrations.
Train with an actor-critic setup where the gradient updates during TADPO affect only the student’s actor and feature encoder while keeping the critic fixed.
Adopt a hierarchical off-road autonomy pipeline with a global planner providing sparse waypoints and an RL controller trained with TADPO to track them, enabling end-to-end control from high-level goals to vehicle commands.
Use a frozen vision backbone (DinoV2 ViT-S/14) and a NatureCNN-based encoder, with Proprioceptive and visual observations, to drive throttle and steering.

Experimental results

Research questions

RQ1Can a teacher-guided PPO extension (TADPO) effectively handle long-horizon planning in off-road autonomy?
RQ2Does concurrent use of demonstrations and on-policy data improve exploration and final policy performance in obstacle-rich, unmapped terrains?
RQ3To what extent can simulation-trained TADPO policies transfer zero-shot to real, full-scale off-road vehicles?
RQ4How does TADPO compare with standard RL and imitation-learning baselines in simulation and real-world tests?

Key findings

TADPO outperforms RL and IL baselines in simulation across extreme slopes, obstacle-rich, and hybrid terrains.
In real-world deployment on a Sabercat, a policy trained with TADPO achieves high obstacle avoidance and low cross-track error without real-world finetuning.
The approach enables zero-shot sim-to-real transfer on a full-scale off-road vehicle, representing a first deployment of end-to-end RL-based policies on such a platform.
Ablation studies show that using a balanced teacher probability (p ≈ 0.5) and the designed clipping on rho leads to robust learning.
The hierarchical pipeline with sparse global planning and dense MPPI-driven teacher demonstrations facilitates long-horizon, high-speed navigation in complex terrains.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.