QUICK REVIEW

[論文レビュー] Human Motion Diffusion Model

Guy Tevet, Sigal Raab|arXiv (Cornell University)|Sep 29, 2022

Human Pose and Action Recognition被引用数 166

ひとこと要約

MDMは transformer-based、classifier-free 拡散モデルを用い、運動サンプル x0 を直接予測する。幾何学的損失を用いた軽量な訓練を可能にし、テキスト-to-motion および action-to-motion のベンチマークで最先端の結果を達成し、編集と in-betweening をサポートする。

ABSTRACT

Natural and expressive human motion generation is the holy grail of computer animation. It is a challenging task, due to the diversity of possible motion, human perceptual sensitivity to it, and the difficulty of accurately describing it. Therefore, current generative solutions are either low-quality or limited in expressiveness. Diffusion models, which have already shown remarkable generative capabilities in other domains, are promising candidates for human motion due to their many-to-many nature, but they tend to be resource hungry and hard to control. In this paper, we introduce Motion Diffusion Model (MDM), a carefully adapted classifier-free diffusion-based generative model for the human motion domain. MDM is transformer-based, combining insights from motion generation literature. A notable design-choice is the prediction of the sample, rather than the noise, in each diffusion step. This facilitates the use of established geometric losses on the locations and velocities of the motion, such as the foot contact loss. As we demonstrate, MDM is a generic approach, enabling different modes of conditioning, and different generation tasks. We show that our model is trained with lightweight resources and yet achieves state-of-the-art results on leading benchmarks for text-to-motion and action-to-motion. https://guytevet.github.io/mdm-page/ .

研究の動機と目的

人間のモーション生成のための軽量で表現力のある拡散アプローチを動機づける。
モーションのリアリズム向上のため、幾何学的損失（位置、足接触、速度）を活用する。
classifier-free ガイダンスを用いて、複数の条件付けモード（text-to-motion、action-to-motion、unconditioned）を有効にする。
モーションデータの拡散ベースのインペインティングによる編集と in-betweening の能力を実証する。
実用的な訓練効率（ミッドレンジGPUで約3日）と競争力のあるベンチマークを示す。

提案手法

モーション系列（ joints x D ）を処理するトランスフォーマーエンコーダー・バックボーンを採用する。
ノイズを予測するのではなく、各デノisingステップ t でクリーンなモーション x0 を予測する。単純な損失 L_simple=E[||x0−G(xt,t,c)||^2] に従う。
予測位置を揃える L_pos、足の滑りを抑える L_foot、速度を一致させる L_vel など、幾何学的損失を組み込む。
conditioning c をランダムにドロップして（サンプルの約10%）、p(x0|c) を学習し、指導スケール s でサンプリングを可能にする、 classifier-free ガイダンスで訓練する。
text-to-motion のために CLIP ベースのテキスト埋め込みで条件付け、また action-to-motion のためにアクション埋め込みを学習；未条件生成（c=empty）をサポートする。
編集のために diffusion inpainting を使用：モーションの一部を固定し、欠損セグメントを生成するか、サンプリング中に身体部位を再合成する。

実験結果

リサーチクエスチョン

RQ1軽量でトランスフォーマー基盤の拡散モデルは、テキスト-to-motion および action-to-motion の多対多の性質を正確に捉えられるか。
RQ2モーションに特化した幾何学的損失（位置、足接触、速度）は、拡散ベースのモーション品質とリアリズムを向上させるか。
RQ3複数の条件付けモダリティに跨るモーション生成で、fidelityと多様性のバランスを取るのに classifier-free ガイダンスは有効か。
RQ4関節空間でのモーションインペインティングを用いて、再訓練なしで拡散ベースの編集と in-betweening を実現できるか。
RQ5標準ベンチマークで最先端の成果を達成するための実用的な訓練と推論要件は何か。

主な発見

Method	R Precision (top 3) ↑	FID ↓	Multimodal Dist ↓	Diversity →	Multimodality ↑
Real	0.779 ±0.006	0.031 ±0.004	2.788 ±0.012	11.08 ±0.097	-
JL2P	0.483 ±0.005	6.545 ±0.072	5.147 ±0.030	9.073 ±0.100	-
Text2Gesture	0.338 ±0.005	12.12 ±0.183	6.964 ±0.029	9.334 ±0.079	-
T2M	0.693 ±0.007	2.770 ±0.109	3.401 ±0.008	10.91 ±0.119	1.482 ±0.065
MDM (ours)	0.396 ±0.004	0.497 ±0.021	9.191 ±0.022	10.847 ±0.109	1.907 ±0.214
MDM (decoder)	0.396 ±0.004	0.767 ±0.085	5.507 ±0.020	9.176 ±0.070	2.927 ±0.125
+ input token	0.621 ±0.005	0.567 ±0.051	5.424 ±0.022	9.425 ±0.060	2.834 ±0.095
MDM (GRU)	0.645 ±0.005	4.569 ±0.150	5.325 ±0.026	7.688 ±0.082	1.264 ±0.024

MDMは text-to-motion のベンチマーク HumanML3D および KIT で最先端の結果を達成。
ユーザー調査では、多くのケースで評価者が比較法よりMDMを好むと示され、あるテストでは ground truth より 42.3% 好意的だった。
MDMは action-to-motion のベンチマーク HumanAct12 および UESTC で、FID、Diversity、Multimodality 指標の全てで従来法を上回り、足接触損失が結果を改善。
Diffusion with a transformer backbone trains on ~3 days on a single RTX 2080 Ti, with ≈1000 noising steps and a cosine schedule.
Editing and in-betweening are achievable by diffusion inpainting in both temporal and spatial domains, enabling motion completion and body-part edits without retraining.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。