Skip to main content
QUICK REVIEW

[Paper Review] Improving Generalization in Reinforcement Learning with Mixture Regularization

Kaixin Wang, Bingyi Kang|arXiv (Cornell University)|Oct 21, 2020
Reinforcement Learning in Robotics15 references46 citations
TL;DR

Mixreg trains RL agents on mixtures of observations from different environments with interpolated supervision signals, improving generalization on Procgen across policy-based and value-based methods.

ABSTRACT

Deep reinforcement learning (RL) agents trained in a limited set of environments tend to suffer overfitting and fail to generalize to unseen testing environments. To improve their generalizability, data augmentation approaches (e.g. cutout and random convolution) are previously explored to increase the data diversity. However, we find these approaches only locally perturb the observations regardless of the training environments, showing limited effectiveness on enhancing the data diversity and the generalization performance. In this work, we introduce a simple approach, named mixreg, which trains agents on a mixture of observations from different training environments and imposes linearity constraints on the observation interpolations and the supervision (e.g. associated reward) interpolations. Mixreg increases the data diversity more effectively and helps learn smoother policies. We verify its effectiveness on improving generalization by conducting extensive experiments on the large-scale Procgen benchmark. Results show mixreg outperforms the well-established baselines on unseen testing environments by a large margin. Mixreg is simple, effective and general. It can be applied to both policy-based and value-based RL algorithms. Code is available at https://github.com/kaixin96/mixreg .

Motivation & Objective

  • Increase training data diversity to reduce generalization gap in RL.
  • Introduce a simple yet effective regularization for RL via mixed observations and supervision.
  • Demonstrate applicability of mixreg to both policy-based and value-based RL algorithms.
  • Show that mixreg yields larger generalization gains than standard data augmentation methods on Procgen.

Proposed method

  • Generate augmented observations by convexly combining two observations s_i and s_j from the training batch: s̃ = λ s_i + (1−λ) s_j, with λ ∼ Beta(α, α).
  • Associate the interpolated supervision ỹ = λ y_i + (1−λ) y_j (e.g., rewards or state values).
  • Apply mixreg to policy-based methods by replacing the standard policy objective with interpolated terms (e.g., L̃^PG includes mixed states and advantages).
  • Apply mixreg to value-based methods (e.g., Rainbow) by replacing target and loss terms with interpolated observations and rewards (e.g., L̃^DQN).
  • Show that mixing supervision signals is crucial for performance gains, beyond mixing observations alone.
  • Demonstrate applicability to PPO (policy-based) and Rainbow (value-based) on Procgen benchmarks.

Experimental results

Research questions

  • RQ1Does mixreg improve zero-shot generalization performance on unseen testing environments?
  • RQ2Is mixreg effective across different RL algorithm families and model sizes?
  • RQ3What mechanisms drive the generalization gains from mixreg (e.g., smoother policies, better representations)?

Key findings

  • Mixreg outperforms PPO baselines by a large margin on 500-level Procgen generalization.
  • Mixreg provides more consistent gains than standard data augmentations and regularizations (e.g., cutout-color, random crop, batch norm, L2).
  • Mixreg improves generalization across varying model sizes and also benefits Rainbow (DQN variant) without requiring task-specific tuning.
  • Mixreg achieves further improvements when combined with other regularizers (e.g., L2).
  • Mixreg’s benefit comes from both learning smoother policies and enabling better representation learning, as shown by ablations and finetuning analyses.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.