[Paper Review] TADPO: Reinforcement Learning Goes Off-road
TADPO extends PPO with teacher action distillation to learn from demonstrations while exploring, enabling end-to-end vision-based off-road control and zero-shot sim-to-real transfer on a full-scale vehicle.
Off-road autonomous driving poses significant challenges such as navigating unmapped, variable terrain with uncertain and diverse dynamics. Addressing these challenges requires effective long-horizon planning and adaptable control. Reinforcement Learning (RL) offers a promising solution by learning control policies directly from interaction. However, because off-road driving is a long-horizon task with low-signal rewards, standard RL methods are challenging to apply in this setting. We introduce TADPO, a novel policy gradient formulation that extends Proximal Policy Optimization (PPO), leveraging off-policy trajectories for teacher guidance and on-policy trajectories for student exploration. Building on this, we develop a vision-based, end-to-end RL system for high-speed off-road driving, capable of navigating extreme slopes and obstacle-rich terrain. We demonstrate our performance in simulation and, importantly, zero-shot sim-to-real transfer on a full-scale off-road vehicle. To our knowledge, this work represents the first deployment of RL-based policies on a full-scale off-road platform.
Motivation & Objective
- Address the challenge of long-horizon, low-signal reinforcement learning for off-road autonomous driving.
- Develop a teacher-guided RL framework that combines demonstrations with on-policy learning.
- Enable end-to-end vision-based control capable of navigating diverse, unmapped terrains and obstacles.
Proposed method
- Introduce TADPO, a policy gradient extension of PPO that learns from fixed demonstrations and on-policy rollouts concurrently.
- Define L_TADPO losses to distill teacher actions via a constrained ratio (rho) and a positive-advantage condition, ensuring updates only when the teacher outperforms the student and not when the student already imitates the teacher.
- Allow teacher and student to operate on potentially different observation spaces to accommodate privileged demonstrations.
- Train with an actor-critic setup where the gradient updates during TADPO affect only the student’s actor and feature encoder while keeping the critic fixed.
- Adopt a hierarchical off-road autonomy pipeline with a global planner providing sparse waypoints and an RL controller trained with TADPO to track them, enabling end-to-end control from high-level goals to vehicle commands.
- Use a frozen vision backbone (DinoV2 ViT-S/14) and a NatureCNN-based encoder, with Proprioceptive and visual observations, to drive throttle and steering.
Experimental results
Research questions
- RQ1Can a teacher-guided PPO extension (TADPO) effectively handle long-horizon planning in off-road autonomy?
- RQ2Does concurrent use of demonstrations and on-policy data improve exploration and final policy performance in obstacle-rich, unmapped terrains?
- RQ3To what extent can simulation-trained TADPO policies transfer zero-shot to real, full-scale off-road vehicles?
- RQ4How does TADPO compare with standard RL and imitation-learning baselines in simulation and real-world tests?
Key findings
- TADPO outperforms RL and IL baselines in simulation across extreme slopes, obstacle-rich, and hybrid terrains.
- In real-world deployment on a Sabercat, a policy trained with TADPO achieves high obstacle avoidance and low cross-track error without real-world finetuning.
- The approach enables zero-shot sim-to-real transfer on a full-scale off-road vehicle, representing a first deployment of end-to-end RL-based policies on such a platform.
- Ablation studies show that using a balanced teacher probability (p ≈ 0.5) and the designed clipping on rho leads to robust learning.
- The hierarchical pipeline with sparse global planning and dense MPPI-driven teacher demonstrations facilitates long-horizon, high-speed navigation in complex terrains.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.