[Paper Review] Mid-Level Visual Representations Improve Generalization and Sample Efficiency for Learning Visuomotor Policies
The paper shows that freezing a set of mid-level vision features improves sample efficiency and generalization for visuomotor policies learned via RL, and proposes a max-coverage feature selector to obtain a compact, task-inclusive feature set.
How much does having visual priors about the world (e.g. the fact that the world is 3D) assist in learning to perform downstream motor tasks (e.g. delivering a package)? We study this question by integrating a generic perceptual skill set (e.g. a distance estimator, an edge detector, etc.) within a reinforcement learning framework--see Figure 1. This skill set (hereafter mid-level perception) provides the policy with a more processed state of the world compared to raw images. We find that using a mid-level perception confers significant advantages over training end-to-end from scratch (i.e. not leveraging priors) in navigation-oriented tasks. Agents are able to generalize to situations where the from-scratch approach fails and training becomes significantly more sample efficient. However, we show that realizing these gains requires careful selection of the mid-level perceptual skills. Therefore, we refine our findings into an efficient max-coverage feature set that can be adopted in lieu of raw images. We perform our study in completely separate buildings for training and testing and compare against visually blind baseline policies and state-of-the-art feature learning methods.
Motivation & Objective
- Assess whether mid-level visual features improve sample efficiency in RL-based visuomotor tasks.
- Evaluate generalization of feature-based policies to unseen environments.
- Determine if a fixed feature suffices for multiple tasks or if a feature set is needed.
Proposed method
- Freeze and reuse pretrained mid-level vision encoders to transform raw observations before RL policy input.
- Use PPO with off-policy corrections to train policies on feature-augmented observations.
- Evaluate 20 mid-level features across navigation, exploration, and planning tasks in Gibson environments with train/test splits in disjoint buildings.
- Quantify performance with relative reward against a blind baseline to account for task difficulty.
- Propose a max-coverage feature selector to choose a compact subset of features that minimizes worst-case transfer distance.
Experimental results
Research questions
- RQ1Do mid-level vision features accelerate learning (sample efficiency) compared to learning from scratch?
- RQ2Do mid-level features enhance generalization to unseen environments?
- RQ3Is a single fixed feature enough for all downstream visuomotor tasks, or is a diverse feature set necessary?
- RQ4Can a compact feature subset maintain performance while reducing data and computation?
Key findings
- Mid-level features yield faster learning across the tested tasks compared to scratch policies.
- Several feature-based agents achieve higher final performance than policies trained from scratch in unseen test environments.
- Rank reversal indicates no universal feature; the best feature depends on the downstream task (semantic features for navigation, geometric features for exploration).
- A max-coverage feature selector can produce compact feature sets that approach or surpass best task-specific features while using far less data.
- The feature set generalizes across multiple buildings and in a second simulator (VizDoom), supporting universality of the approach under varied settings.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.