QUICK REVIEW

[논문 리뷰] DiMo: Discrete Diffusion Modeling for Motion Generation and Understanding

Ning Zhang, Zhengyu Li|arXiv (Cornell University)|2026. 02. 04.

Human Motion and Animation인용 수 0

한 줄 요약

DiMo는 텍스트-모션과 모션-텍스트를 양방향으로 처리하는 통합된 이산 확산 프레임워크와 텍스트-없음 모션 작업을 도입하여 반복적 마스킹 정제와 RVQ 토크나이제이션을 통해 품질-지연 간의 트레이드오프를 가능하게 한다.

ABSTRACT

Prior masked modeling motion generation methods predominantly study text-to-motion. We present DiMo, a discrete diffusion-style framework, which extends masked modeling to bidirectional text--motion understanding and generation. Unlike GPT-style autoregressive approaches that tokenize motion and decode sequentially, DiMo performs iterative masked token refinement, unifying Text-to-Motion (T2M), Motion-to-Text (M2T), and text-free Motion-to-Motion (M2M) within a single model. This decoding paradigm naturally enables a quality-latency trade-off at inference via the number of refinement steps. We further improve motion token fidelity with residual vector quantization (RVQ) and enhance alignment and controllability with Group Relative Policy Optimization (GRPO). Experiments on HumanML3D and KIT-ML show strong motion quality and competitive bidirectional understanding under a unified framework. In addition, we demonstrate model ability in text-free motion completion, text-guided motion prediction and motion caption correction without architectural change. Additional qualitative results are available on our project page: https://animotionlab.github.io/DiMo/.

연구 동기 및 목표

싱글 모델에서 텍스트-모션 간 이해 및 생성을 양방향으로 수행하도록 동기를 부여하여 일관성을 높이고 엔지니어링 오버헤드를 줄인다.
T2M 및 M2T 태스크에 대해 자기회귀 디코딩을 대체하는 확산 스타일의 병렬 잡음 제거 프레임워크를 제안한다.
고충실도 모션 토큰과 개선된 정렬 및 제어를 위한 잔차 벡터 양자화(RVQ) 도입.
다단계 정제를 통한 품질-지연 트레이드오프를 보여주고 모션 완성, 캡션 수정과 같은 텍스트-없음 작업으로의 확장을 시연한다.

제안 방법

텍스트와 모션을 모두 이산 토큰 시퀀스로 취급하고 K-단계 병렬 노이즈 제거 프로세스를 적용한다.
텍스트 추론 핵심으로 BERT 기반 마스크드 언어 모델 백본을 사용한다.
이산 모션 토큰을 처리하기 위해 RVQ 기반 모션 토크나이저와 분리된 모션 인코더/디코더를 도입한다.
교차 모달 정렬 및 제어 가능성을 개선하기 위해 선택적으로 GRPO를 적용한다.
세 가지 태스크(Text-to-Motion, Motion-to-Text, Motion-to-Motion)에서 다중 작업 마스킹으로 학습한다.
정신 확신도에 기반한 점진적 추론으로 디노이징 중 높은 확신도 토큰을 먼저 확정한다.

Figure 1 : Overview of DiMo. DiMo unifies Motion-to-Text (M2T) and Text-to-Motion (T2M) within a single framework, achieving a strong balance between motion realism and semantic consistency across generation and understanding tasks.

실험 결과

연구 질문

RQ1단일 아키텍처 내에서 텍스트-모션 양방향 기능을 지원하는 단일 이산 확산 모델이 가능한가?
RQ2반복적 마스킹 정제가 긴 모션 시퀀스에 대해 자기회귀 디코딩보다 품질 향상과 편집성을 제공하는가?
RQ3RVQ 기반 모션 토큰화가 재구성 충실도 및 다운스트림 교차 모달 생성에 어떤 영향을 미치는가?
RQ4GRPO 미세조정이 교차 모달 정렬 및 제어 가능성에 미치는 영향은 무엇인가?
RQ5아키텍처 변경 없이도 텍스트-없음 작업(모션 완성 및 캡션 수정)을 프레임워크가 자연스럽게 지원하는가?

주요 결과

Category	Method	T2M R@1	T2M R@2	T2M R@3	T2M FID	T2M Div →	T2M MM	M2T R@1	M2T R@3	M2T BLEU@1	M2T BLEU@4	M2T ROUGE-L	M2T CIDEr	M2T BERTScore
Text-to-Motion	Ours w/ GRPO	0.528	0.724	0.818	0.047	9.419	2.000	0.577	0.855	64.2	22.7	47.1	58.1	37.7

DiMo는 T2M 및 M2T 모두에서 HumanML3D와 KIT-ML으로 경쟁력 있는 모션 품질을 달성한다.
다수의 정제 단계가 있는 확산 스타일 디코딩은 품질-지연 트레이드오프를 tunable하게 제공한다(예: 5–30단계).
RVQ는 모션 토큰의 충실도를 향상시키고 양자화 오차를 감소시켜 다운스트림 성능을 높인다.
GRPO 미세조정은 방향 간 정렬 및 의미적 충실도를 향상시킨다.
DiMo는 동일한 프레임워크 내에서 텍스트-없음 완료/예측 및 캡션 수정도 지원한다.
Table 1(HumanML3D)에서 Ours w/ GRPO는 T2M 및 M2T 지표에서 R@1 0.528, R@3 0.724, 0.818–0.855 범위의 지표와 탁월한 지각 품질(FID 0.047)을 달성한다.
Table 2(KIT-ML)에서 Ours는 기초모델과 비교하여 경쟁력 있는 T2M 및 M2T 결과를 보인다.

Figure 2 : Overview of DiMo. Our unified framework supports text-to-motion (T2M), motion-to-text (M2T), and motion-to-motion (M2M) tasks with RVQ-based motion tokenization, multi-task masked training, confidence-guided progressive inference, and GRPO fine-tuning.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.