QUICK REVIEW

[论文解读] Efficient Unsupervised Environment Design through Hierarchical Policy Representation Learning

Dexun Li, Sidney Tio|arXiv (Cornell University)|Feb 10, 2026

Intelligent Tutoring Systems and Adaptive Learning被引用 0

一句话总结

SHED 引入一种分层 MDP 教师，通过基于学生策略表示来设计环境，并使用扩散生成的合成数据在有限交互下高效训练，在多个领域优于基线。

ABSTRACT

Unsupervised Environment Design (UED) has emerged as a promising approach to developing general-purpose agents through automated curriculum generation. Popular UED methods focus on Open-Endedness, where teacher algorithms rely on stochastic processes for infinite generation of useful environments. This assumption becomes impractical in resource-constrained scenarios where teacher-student interaction opportunities are limited. To address this challenge, we introduce a hierarchical Markov Decision Process (MDP) framework for environment design. Our framework features a teacher agent that leverages student policy representations derived from discovered evaluation environments, enabling it to generate training environments based on the student's capabilities. To improve efficiency, we incorporate a generative model that augments the teacher's training dataset with synthetic data, reducing the need for teacher-student interactions. In experiments across several domains, we show that our method outperforms baseline approaches while requiring fewer teacher-student interactions in a single episode. The results suggest the applicability of our approach in settings where training opportunities are limited.

研究动机与目标

在有限交互预算下，为强化学习代理设计训练环境的框架。
用策略型表现向量来表示学生能力，在各评估环境中进行评估。
利用扩散式世界模型生成合成教师体验。
在多次学生会话中分摊初始教师训练成本以提升效率。

提出的方法

将环境设计建模为两层分层 MDP，包含上层教师和下层学生。
通过在 m 个评估环境中的表现向量 p(π) 表示学生能力，以引导环境生成。
使用条件扩散模型生成合成转移（s^u, a^u, s^u′），用于离线教师训练。
定义教师奖励，结合学习进度与在评估环境中的公平性。
离散化评估环境，以创建学生策略的有限、稳定表示（理论 4.1 的证明）。
在严格的交互预算下，将 SHED 与强基线（ACCEL、编辑版 ACCEL、PAIRED、h-MDP）进行对比评估。

实验结果

研究问题

RQ1在预算约束下，分层 MDP 教师是否能有效针对不断演进的学生能力定制环境生成？
RQ2扩散基合成数据是否能加速教师的离线训练而不牺牲策略质量？
RQ3离散化评估环境如何影响学生策略表示的稳定性与泛化？
RQ4在资源限制下，携带 SHED 的教师是否比现有 UED 方法在对未见环境的零-shot 转移上表现更好？

主要发现

SHED 在有限交互下在 Lunar Lander、Bipedal Walker 与 Maze 上优于基线。
SHED 在未见测试环境上的零-shot 转移性能更高。
扩散生成的合成轨迹减少对真实学生数据的需求，加速教师训练。
与 PAIRED 等基线相比，该方法在重复性方面表现出低变异性（IQM 稳定、误差带窄）。
消融实验显示扩散数据对性能提升的贡献超出不使用扩散的 h-MDP。
初始教师训练成本在随后的学生训练阶段得到分摊。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。