QUICK REVIEW

[論文レビュー] DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Ning Zhang, Zhengyu Li|arXiv (Cornell University)|Feb 4, 2026

Human Motion and Animation被引用数 0

ひとこと要約

DiMo は、テキスト→モーションとモーション→テキストを双方向に処理し、テキストなしモーションタスクにも対応する、反復的なマスク付き精製と RVQ トークン化を通じて、品質と待機時間のトレードオフを実現する統一的離散拡散フレームワークを提案します。

ABSTRACT

Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps. We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change. Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

研究の動機と目的

単一モデルで双方向のテキスト–モーション理解と生成を実現し、一貫性を向上させエンジニアリングの手間を削減する動機づけ。
T2M および M2T タスクのための自己回帰デコードを置換する拡散風スタイルの並列ノイズ除去フレームワークを提案。
高忠実度モーショントークンと改善された整列性・制御性のためのRVQ を導入。
多段階の精製による品質–待機時間のトレードオフを示し、モーション完成やキャプション修正などのテキスト不要タスクへの拡張を示す。

提案手法

テキストとモーションの両方を離散トークン列として扱い、Kステップの並行ノイズ除去を適用。
テキスト推論コアとして BERT ベースのマスク付き言語モデルをバックボーンに採用。
RVQ ベースのモーショントークナイザを導入し、離散モーショントークンを処理する別個のモーションエンコーダ/デコーダを用意。
クロスモーダル整列性と制御性を改善するために GRPO の微調整を任意で適用。
Text-to-Motion（T2M）、Motion-to-Text（M2T）、Motion-to-Motion（M2M）の三タスクでマスクを用いた多タスク学習を行う。
デノイジング中に高信頼トークンを先に確定させるための信頼度ガイド付き段階推論を利用。

Figure 1 : Overview of DiMo. DiMo unifies Motion-to-Text (M2T) and Text-to-Motion (T2M) within a single framework, achieving a strong balance between motion realism and semantic consistency across generation and understanding tasks.

実験結果

リサーチクエスチョン

RQ1統一された離散拡散モデルは、単一アーキテクチャ内で双方向機能を持つ T2M および M2T をサポートできるか。
RQ2反復的マスク付き精製は長尺モーション列に対する自己回帰デコードより品質改善と編集性を提供するか。
RQ3RVQ ベースのモーショントークン化は再構成忠実度と下流のクロスモーダル生成にどのような影響を与えるか。
RQ4GRPO 微調整はクロスモーダル整列性と制御性をどう改善するか。
RQ5テキスト不要タスク（モーション Completion やキャプション修正）をアーキテクチャ変更なしで自然にサポートできるか。

主な発見

カテゴリ	方法	T2M R@1	T2M R@2	T2M R@3	T2M FID	T2M Div →	T2M MM	M2T R@1	M2T R@3	M2T BLEU@1	M2T BLEU@4	M2T ROUGE-L	M2T CIDEr	M2T BERTScore
Text-to-Motion	Ours w/ GRPO	0.528	0.724	0.818	0.047	9.419	2.000	0.577	0.855	64.2	22.7	47.1	58.1	37.7

DiMo は T2M および M2T の両方で HumanML3D および KIT-ML において競争力のあるモーション品質を達成。
複数の refine ステップを伴う拡散風デコードは、品質–待機時間の調整可能なトレードオフを生む（例: 5–30ステップ）。
RVQ はモーショントークンの忠実度を向上させ、量子化誤差を低減し下流の性能を向上。
GRPO の微調整は方向間での整列性と意味的忠実性を強化。
DiMo は同じフレームワーク内でテキストなしの補完/予測とキャプション修正をサポート。
Table 1（HumanML3D）では Ours w/ GRPO が T2M および M2T 指標で R@1 0.528、R@3 0.724、その他の指標は約 0.818–0.855、 perceptual quality は高く（FID 0.047）。
Table 2（KIT-ML）では Ours がベースラインと比較して競争力のある T2M および M2T 結果を示す。

Figure 2 : Overview of DiMo. Our unified framework supports text-to-motion (T2M), motion-to-text (M2T), and motion-to-motion (M2M) tasks with RVQ-based motion tokenization, multi-task masked training, confidence-guided progressive inference, and GRPO fine-tuning.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。