Skip to main content
QUICK REVIEW

[논문 리뷰] Teacher algorithms for curriculum learning of Deep RL in continuously parameterized environments

Rémy Portelas, Cédric Colas|arXiv (Cornell University)|2019. 10. 16.
Evolutionary Algorithms and Applications인용 수 44
한 줄 요약

논문은 Continuous Teacher-Student CTS 프레임워크와 ALP-GMM을 소개하며, Gaussian Mixture Model 기반의 교사가 DRL 에이전트의 커리큘럼-샘플을 연속 매개변수화된 환경 분포에 대해 학습 진행을 최적화한다.

ABSTRACT

We consider the problem of how a teacher algorithm can enable an unknown Deep Reinforcement Learning (DRL) student to become good at a skill over a wide range of diverse environments. To do so, we study how a teacher algorithm can learn to generate a learning curriculum, whereby it sequentially samples parameters controlling a stochastic procedural generation of environments. Because it does not initially know the capacities of its student, a key challenge for the teacher is to discover which environments are easy, difficult or unlearnable, and in what order to propose them to maximize the efficiency of learning over the learnable ones. To achieve this, this problem is transformed into a surrogate continuous bandit problem where the teacher samples environments in order to maximize absolute learning progress of its student. We present a new algorithm modeling absolute learning progress with Gaussian mixture models (ALP-GMM). We also adapt existing algorithms and provide a complete study in the context of DRL. Using parameterized variants of the BipedalWalker environment, we study their efficiency to personalize a learning curriculum for different learners (embodiments), their robustness to the ratio of learnable/unlearnable environments, and their scalability to non-linear and high-dimensional parameter spaces. Videos and code are available at https://github.com/flowersteam/teachDeepRL.

연구 동기 및 목표

  • ill-defined Continuous parameter spaces encoding distributions of tasksFormalize a Continuous Teacher-Student (CTS) framework for ill-defined continuous parameter spaces encoding distributions of tasks.
  • Propose and evaluate ALP-GMM and RIAC-style teachers that maximize absolute learning progress to guide DRL students.
  • Demonstrate scalability to high-dimensional, non-linear, and partially unlearnable parameter spaces via parameterized BipedalWalker environments.
  • Assess robustness to unlearnable regions and irrelevant task dimensions while preserving learning efficiency.

제안 방법

  • Formal CTS framework where a teacher samples a parameter p mapping to a task distribution T(p) and selects m tasks for the student.
  • Define objective to maximize the student’s final competence across the parameter space using an interaction history H.
  • Introduce ALP-GMM: fit GMMs on recent parameter–ALP pairs and use EXP4 with ALP-based arms to sample high-ALP regions, with random exploration.
  • Use per-parameter ALP derived from the difference in rewards with nearest previously sampled parameter to guide sampling.
  • Compare ALP-GMM, Covar-GMM, RIAC against Random and Oracle baselines across two parameterized BipedalWalker environments (Stump Tracks and Hexagon Tracks).
  • Evaluate performance via a binary mastery metric (r_p > 230) over a fixed test set, with DRL agent being Soft Actor-Critic.

실험 결과

연구 질문

  • RQ1Can LP-based teacher strategies scaffold learning for DRL agents in continuously parameterized environments?
  • RQ2How do ALP-GMM and related teachers perform relative to Random and Oracle curricula under varying proportions of unlearnable tasks and high-dimensional spaces?
  • RQ3Are these methods robust to ill-defined parameter spaces with irrelevant dimensions and non-linear difficulty gradients?
  • RQ4How scalable are the approaches to high-dimensional task spaces like Hexagon Tracks?

주요 결과

  • ALP-GMM outperforms Covar-GMM and RIAC in final mean performance on default and harder morphologies, and can surpass Oracle in some settings.
  • LP-based teachers significantly outperform Random, and ALP-GMM demonstrates robustness to increasing unfeasible task proportions.
  • In high-dimensional Hexagon Tracks, ALP-GMM achieves higher mastered-track percentages than Oracle (80% vs 68%), with Covar-GMM and RIAC performing worse.
  • ALP-GMM shows better stability and lower variance than alternatives in complex spaces with irrelevant dimensions.
  • The approach remains effective across ill-defined parameter spaces, including non-linear difficulty landscapes and unlearnable subspaces.
  • Oracle performs highly in some early phases but PL-based teachers avoid forgetting and sustain progress over time.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.