QUICK REVIEW

[論文レビュー] MotionDiffuse: Text-Driven Human Motion Generation with Diffusion Model

Mingyuan Zhang, Zhongang Cai|arXiv (Cornell University)|Aug 31, 2022

Human Motion and Animation被引用数 110

ひとこと要約

MotionDiffuse は、拡散モデルフレームワークとクロスモダリティ・トランスフォーマを用いて、多様で制御可能なテキスト駆動の人間の動作を生成し、ボディパーツと時間変動の制御を含みます。

ABSTRACT

Human motion modeling is important for many modern graphics applications, which typically require professional skills. In order to remove the skill barriers for laymen, recent motion generation methods can directly generate human motions conditioned on natural languages. However, it remains challenging to achieve diverse and fine-grained motion generation with various text inputs. To address this problem, we propose MotionDiffuse, the first diffusion model-based text-driven motion generation framework, which demonstrates several desired properties over existing methods. 1) Probabilistic Mapping. Instead of a deterministic language-motion mapping, MotionDiffuse generates motions through a series of denoising steps in which variations are injected. 2) Realistic Synthesis. MotionDiffuse excels at modeling complicated data distribution and generating vivid motion sequences. 3) Multi-Level Manipulation. MotionDiffuse responds to fine-grained instructions on body parts, and arbitrary-length motion synthesis with time-varied text prompts. Our experiments show MotionDiffuse outperforms existing SoTA methods by convincing margins on text-driven motion generation and action-conditioned motion generation. A qualitative analysis further demonstrates MotionDiffuse's controllability for comprehensive motion generation. Homepage: https://mingyuan-zhang.github.io/projects/MotionDiffuse.html

研究の動機と目的

自然言語の記述から現実的な人間の動作を作成するハードルを下げる。
高い多様性のために、確率的かつ拡散ベースのテキストから動作生成アプローチを導入する。
体の部位レベルの制御を含むマルチレベルな操作と、任意長の動作合成を可能にする。
テキスト駆動およびアクション条件付きの動作生成タスクで最先端性能を示す。

提案手法

テキスト記述を条件に、拡散モデル（DDPM）を用いて動作列を生成する。
可変長シーケンスを扱うために、テキストエンコーダと動作デコーダを備えたクロスモダリティ・リニア・トランスフォーマを導入する。
テキストを動作生成に融合するために、Linear Self-Attention（Efficient Attention）と Linear Cross-Attention を組み込む。
各デノイジングステップにテキストと時間（t）情報を注入する Stylization Block を適用する。
パート分割された体の部位間のノイズ補間と平滑化補正を通じて、体の部位独立の制御を実装する。
複数の区間をデノイズして、補正項付きでノイズを補間することで、時間変動制御を有効にする。
拡散過程におけるノイズ項 ε_theta を予測する1つの損失を最適化することで訓練する。

実験結果

リサーチクエスチョン

RQ1拡散モデルは自然言語プロンプトから多様で高忠実な動作を生成できるか？
RQ2クロスモダリティ・トランスフォーマは可変長シーケンスのテキストを動作生成へ効果的に融合できるか？
RQ3細粒度な体の部位レベルと時間変動のプロンプトを、品質を落とさずに動作合成中に制御できるか？
RQ4MotionDiffuse は、テキスト駆動およびアクション条件付き動作生成ベンチマークで、従来の最先端手法と比べてどの程度性能を発揮するか？

主な発見

MotionDiffuse は、テキスト駆動動作生成およびアクション条件付き動作生成において、従来の最先端手法に対して顕著な改善を達成した。
このフレームワークは、自然言語プロンプトにより高忠実度で多様な動作合成を示す。
マルチレベルの操作により、追加の訓練コストなしで体の部位レベルの制御と時間変動列の生成を可能にする。
定性的分析は、MotionDiffuse の制御性と複雑で長い動作列の処理能力を示している。
複数のデータセット（例：HumanML3D、KIT-ML、HumanAct12、UESTC）での実験は、適用範囲の広さと、既存手法に対する優位性を示している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。