[论文解读] AdaptDiffuser: Diffusion Models as Adaptive Self-evolving Planners
AdaptDiffuser 使用带判别器的数据筛选循环的奖励引导扩散来自我进化扩散规划器,在已见任务上提升性能并在未见任务上实现零-shot 泛化,且无需额外专家数据。
Diffusion models have demonstrated their powerful generative capability in many tasks, with great potential to serve as a paradigm for offline reinforcement learning. However, the quality of the diffusion model is limited by the insufficient diversity of training data, which hinders the performance of planning and the generalizability to new tasks. This paper introduces AdaptDiffuser, an evolutionary planning method with diffusion that can self-evolve to improve the diffusion model hence a better planner, not only for seen tasks but can also adapt to unseen tasks. AdaptDiffuser enables the generation of rich synthetic expert data for goal-conditioned tasks using guidance from reward gradients. It then selects high-quality data via a discriminator to finetune the diffusion model, which improves the generalization ability to unseen tasks. Empirical experiments on two benchmark environments and two carefully designed unseen tasks in KUKA industrial robot arm and Maze2D environments demonstrate the effectiveness of AdaptDiffuser. For example, AdaptDiffuser not only outperforms the previous art Diffuser by 20.8% on Maze2D and 7.5% on MuJoCo locomotion, but also adapts better to new tasks, e.g., KUKA pick-and-place, by 27.9% without requiring additional expert data. More visualization results and demo videos could be found on our project page.
研究动机与目标
- Motivate and address the limited diversity of offline RL data for diffusion-based planners.
- Propose a self-evolving diffusion framework to generate and filter synthetic demonstrations guided by reward gradients.
- Enable zero-shot adaptation to unseen tasks without additional expert data through data-driven fine-tuning.
- Demonstrate improved performance on Maze2D, MuJoCo locomotion, and KUKA/Maze2D unseen tasks.
提出的方法
- Model planning as a conditional diffusion process with guidance from reward-to-go or task constraints (Eq. 7–8).
- Generate synthetic demonstrations via reward-guided diffusion and refine data quality with a discriminator-based selection loop (data pool).
- Ensure dynamic feasibility by recovering executable actions using an inverse dynamics model and filtering based on state predictability (Eq. 9).
- Iteratively fine-tune the forward diffusion model with high-quality synthetic data to improve μθ and Σ for better self-evolution (Eq. 10).
- Handle continuous and sparse rewards by defining appropriate reward-guided objectives, including task constraints and auxiliary rewards (Eq. 11).
- Evaluate on Maze2D, MuJoCo D4RL benchmarks, and KUKA pick-and-place/unseen tasks to demonstrate improved performance and zero-shot adaptation.
实验结果
研究问题
- RQ1Can reward-guided diffusion generate diverse synthetic demonstrations for offline RL tasks?
- RQ2Does a discriminator-based data selection loop improve the diffusion model’s planning quality and robustness to unseen tasks?
- RQ3Can self-evolved diffusion planners generalize to unseen objectives without extra expert data?
- RQ4How does AdaptDiffuser perform compared to Diffuser and other offline RL baselines on standard benchmarks and novel tasks?
主要发现
| Environment | MPPI | CQL | IQL | Diffuser | AdaptDiffuser |
|---|---|---|---|---|---|
| U-Maze | 33.2 | 5.7 | 47.4 | 113.9 | 135.1 ± 5.8 |
| Medium | 10.2 | 5.0 | 34.9 | 121.5 | 129.9 ± 4.6 |
| Large | 5.1 | 12.5 | 58.6 | 123.0 | 167.9 ± 5.0 |
| Average | 16.2 | 7.7 | 47.0 | 119.5 | 144.3 |
- AdaptDiffuser improves Maze2D performance over Diffuser by about 20.8% on Maze2D and 7.5% on MuJoCo locomotion.
- In MuJoCo experiments, AdaptDiffuser achieves higher average returns across multiple datasets than Diffuser and several baselines, notably in Hopper-Medium and Walker2d-Medium.
- AdaptDiffuser demonstrates zero-shot adaptation to unseen tasks (e.g., KUKA pick-and-place) with substantial gains (e.g., average improvement over Diffuser: ~5-6 points in reported setups).
- Visualization shows AdaptDiffuser producing feasible, smoother paths in hard Maze2D cases where Diffuser fails or produces collisions.
- Across Maze2D and MuJoCo benchmarks, AdaptDiffuser consistently outperforms the baseline Diffuser, indicating improved self-bootstrapping and generalization.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。