QUICK REVIEW

[论文解读] Is Conditional Generative Modeling all you need for Decision-Making?

Anurag Ajay, Yilun Du|arXiv (Cornell University)|Nov 28, 2022

Language and cultural evolution被引用 31

一句话总结

本论文将离线决策问题重新框架为条件扩散建模，表明一个回报条件扩散模型（Decision Diffuser）在无需 TD 学习的情况下即可产出具有竞争力甚至更优的策略，并且能够灵活处理约束和技能组合。

ABSTRACT

Recent improvements in conditional generative modeling have made it possible to generate high-quality images from language descriptions alone. We investigate whether these methods can directly address the problem of sequential decision-making. We view decision-making not through the lens of reinforcement learning (RL), but rather through conditional generative modeling. To our surprise, we find that our formulation leads to policies that can outperform existing offline RL approaches across standard benchmarks. By modeling a policy as a return-conditional diffusion model, we illustrate how we may circumvent the need for dynamic programming and subsequently eliminate many of the complexities that come with traditional offline RL. We further demonstrate the advantages of modeling policies as conditional diffusion models by considering two other conditioning variables: constraints and skills. Conditioning on a single constraint or skill during training leads to behaviors at test-time that can satisfy several constraints together or demonstrate a composition of skills. Our results illustrate that conditional generative modeling is a powerful tool for decision-making.

研究动机与目标

使用条件生成模型来推动序列决策，超越传统 RL 的动机。
证明一个回报条件扩散模型能够将次优的离线轨迹拼接成高回报的计划，而无需价值函数估计。
展示对约束和技能的条件化，以在测试时产生复合行为。
提出无分类器引导的低温采样以最大化来自离线数据的轨迹回报。
在标准基准上提供证据，表明条件生成建模可以超越若干离线 RL 基线。

提出的方法

将轨迹建模为状态仅依赖的扩散过程，并使用逆动力学获取动作。
训练一个反向扩散模型 p_theta，在条件 y(tau)（回报、约束或技能）下对嘈杂的状态序列进行去噪。
使用无分类器引导的低温采样来在没有显式 Q 函数的情况下偏向高回报或满足约束的轨迹生成。
对返回、约束或技能进行条件化，以生成在测试时最大化回报、同时满足多项约束或能够组合技能的行为。
结合一个逆动力学模型 f_phi(s_t, s_{t+1}) 将生成的状态转移映射到可执行的动作。
以最大似然风格的目标和带有偶尔条件随机失活的去噪损失，联合训练扩散模型与逆动力学。

实验结果

研究问题

RQ1一个回报条件扩散模型是否能够在不进行动态规划或 Q 函数估计的情况下恢复或超过离线 RL 的表现？
RQ2在测试时对约束和技能等附加因素进行条件化，是否能够实现行为的灵活组合？
RQ3使用低温采样的无分类器引导是否有效地将离线数据偏向高回报轨迹？
RQ4扩散为基础的策略与基于 TD 的离线 RL 方法在标准基准上的比较如何？
RQ5该方法是否能够处理多约束和多技能场景并在推理时进行组合？

主要发现

Dataset	Environment	BC	CQL	IQL	DT	TT	MOReL	Diffuser	DD
Med-Expert	HalfCheetah	$55.2$	$91.6$	$86.7$	$86.8$	95	$53.3$	$79.8$	$90.6$ \\pm 1.3$
Med-Expert	Hopper	$52.5$	$105.4$	$91.5$	$107.6$	110.0	$108.7$	$107.2$	111.8 \\pm 1.8$
Med-Expert	Walker2d	$107.5$	$108.8$	$109.6$	$108.1$	$101.9$	$95.6$	108.4	108.8 \\pm 1.7$
Medium	HalfCheetah	$42.6$	$44.0$	$47.4$	$42.6$	$46.9$	$42.1$	$44.2$	49.1 \\pm 1.0$
Medium	Hopper	$52.9$	$58.5$	$66.3$	$67.6$	$61.1$	-	$58.5$	$79.3$ \\pm 3.6$
Medium	Walker2d	$75.3$	$72.5$	$78.3$	$74.0$	$79$	$77.8$	$79.7$	82.5 \\pm 1.4$
Med-Replay	HalfCheetah	$36.6$	45.5	44.2	$36.6$	$41.9$	$40.2$	$42.2$	$39.3$ \\pm 4.1$
Med-Replay	Hopper	$18.1$	$95$	$94.7$	$82.7$	$91.5$	$93.6$	$96.8$	100 \\pm 0.7$
Med-Replay	Walker2d	$26.0$	$77.2$	$73.9$	$66.6$	$82.6$	$49.8$	$61.2$	$75$ \\pm 4.3$
Average	(Across tasks)	51.9	77.6	77	74.7	78.9	72.9	75.3	81.8
Mixed	Kitchen	$44.8$	$51.2$	$48.7$	-	-	-	-	61 \\pm 2.8$

Decision Diffuser 在 D4RL 运动任务和 Kitchen 任务上与若干离线 RL 基线（TD 方法）相当或优于它们。
与基线扩散模型相比，使用低温采样的无分类器引导可以提高轨迹质量和回报最大化。
对于动作提取使用逆动力学比在评估的环境中对动作进行扩散更具性能。
该方法能够在 Kuka Block Stacking 中有效满足单一和多重约束，优于在某些任务上失败的 BCQ 和 CQL。
Unitree-go-running 的技能组合实验表明，在对多技能进行条件化时，轨迹会在不同步态之间切换；基于分类器的分析证实了生成序列中的步态切换。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。