QUICK REVIEW

[论文解读] Human Motion Diffusion Model

Guy Tevet, Sigal Raab|arXiv (Cornell University)|Sep 29, 2022

Human Pose and Action Recognition被引用 166

一句话总结

MDM 使用基于 transformer 的、无需分类器的扩散模型，直接预测 motion sample x0，实现用几何损失进行轻量化训练；在文本到动作和动作到动作基准上取得最先进的结果，并支持编辑与中间生成。

ABSTRACT

Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. https://guytevet.github.io/mdm-page/ .

研究动机与目标

为人体运动生成提供一种轻量但富有表达力的扩散方法作为动机。
利用几何损失（位置、足部接触、速度）来提升运动的真实感。
通过 classifier-free 指引，支持多种条件模式（文本到运动、动作到运动、无条件）。
展示通过基于扩散的修复在运动数据中进行编辑与插值的能力。
展示实用的训练效率（在中等配置的 GPU 上约 3 天）以及具有竞争力的基准。

提出的方法

采用处理运动序列（关节 x D）的 transformer-encoder 主干。
在每个去噪步骤 t 预测干净的运动 x0，而不是预测噪声，遵循简单损失 L_simple=E[||x0−G(xt,t,c)||^2]。
引入几何损失：L_pos 对齐预测位置、L_foot 降低脚步滑动、L_vel 匹配速度。
通过随机丢弃条件 c（≈10% 的样本）来学习 p(x0|c) 并实现带引导尺度 s 的采样，使用 classifier-free 指引进行训练。
对文本到运动使用基于 CLIP 的文本嵌入进行条件，或为动作到运动学习动作嵌入；支持无条件生成（c=empty）。
使用扩散修复进行编辑：在采样期间固定运动的某些部分并生成缺失片段或重合成身体部位。

实验结果

研究问题

RQ1一个轻量的、基于 transformer 的扩散模型是否能准确捕捉文本到运动和动作到运动任务的多对多性质？
RQ2为运动定制的几何损失（位置、脚部接触、速度）是否提升基于扩散的运动质量与真实感？
RQ3在多种条件模态下，classifier-free 指引是否在保真度与多样性之间取得平衡？
RQ4是否可以通过在关节空间进行运动修复实现基于扩散的编辑与在间插，而无需重新训练？
RQ5达到标准基准上最先进结果所需的实际训练与推断要求是什么？

主要发现

方法	R Precision (top 3) ↑	FID ↓	Multimodal Dist ↓	Diversity →	Multimodality ↑
Real	0.779 ±0.006	0.031 ±0.004	2.788 ±0.012	11.08 ±0.097	-
JL2P	0.483 ±0.005	6.545 ±0.072	5.147 ±0.030	9.073 ±0.100	-
Text2Gesture	0.338 ±0.005	12.12 ±0.183	6.964 ±0.029	9.334 ±0.079	-
T2M	0.693 ±0.007	2.770 ±0.109	3.401 ±0.008	10.91 ±0.119	1.482 ±0.065
MDM (ours)	0.396 ±0.004	0.497 ±0.021	9.191 ±0.022	10.847 ±0.109	1.907 ±0.214
MDM (decoder)	0.396 ±0.004	0.767 ±0.085	5.507 ±0.020	9.176 ±0.070	2.927 ±0.125
+ input token	0.621 ±0.005	0.567 ±0.051	5.424 ±0.022	9.425 ±0.060	2.834 ±0.095
MDM (GRU)	0.645 ±0.005	4.569 ±0.150	5.325 ±0.026	7.688 ±0.082	1.264 ±0.024

MDM 在文本到运动基准 HumanML3D 和 KIT 上达到最先进结果。
用户研究显示评估者在许多情况下偏好 MDM，相较于可比方法，在一次测试中对 ground truth 的偏好达到 42.3%。
MDM 在动作到运动的基准 HumanAct12 和 UESTC 上在 FID、多样性和多模态性指标方面优于先前方法（脚部接触损失提升了结果）。
具有 transformer 主干的扩散训练在单个 RTX 2080 Ti 上约 3 天，约 1000 次去噪步骤，采用余弦计划。
通过扩散修复在时间和空间域中实现编辑与插值成为可能，支持在不重新训练的情况下完成运动补全和身体部位编辑。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。