Skip to main content
QUICK REVIEW

[Paper Review] Learning to Generate Diverse Dance Motions with Transformer

Jiaman Li, Yihang Yin|arXiv (Cornell University)|Aug 18, 2020
Human Motion and Animation22 references74 citations
TL;DR

The paper presents a two-stream Transformer model (TSMT) that conditions on music to synthesize diverse, long-range dance motions from large-scale YouTube-derived 3D pose data, with new evaluation metrics and a BEAT-aware, diversity-focused evaluation framework.

ABSTRACT

With the ongoing pandemic, virtual concerts and live events using digitized performances of musicians are getting traction on massive multiplayer online worlds. However, well choreographed dance movements are extremely complex to animate and would involve an expensive and tedious production process. In addition to the use of complex motion capture systems, it typically requires a collaborative effort between animators, dancers, and choreographers. We introduce a complete system for dance motion synthesis, which can generate complex and highly diverse dance sequences given an input music sequence. As motion capture data is limited for the range of dance motions and styles, we introduce a massive dance motion data set that is created from YouTube videos. We also present a novel two-stream motion transformer generative model, which can generate motion sequences with high flexibility. We also introduce new evaluation metrics for the quality of synthesized dance motions, and demonstrate that our system can outperform state-of-the-art methods. Our system provides high-quality animations suitable for large crowds for virtual concerts and can also be used as reference for professional animation pipelines. Most importantly, we show that vast online videos can be effective in training dance motion models.

Motivation & Objective

  • Motivate a scalable, data-driven approach to dance motion synthesis that can produce diverse motions conditioned on music.
  • Leverage large-scale, web-sourced dance data to overcome limited mocap datasets.
  • Develop a two-stream Transformer architecture that captures long-term dependencies and music-dance correlations.
  • Introduce evaluation metrics for physical plausibility, beat consistency, and motion diversity.

Proposed method

  • Formulate dance synthesis as an autoregressive generative model conditioned on music and past motions.
  • Represent continuous 3D joint poses as discrete categories to enable diverse sampling.
  • Introduce Two-Stream Motion Transformer (TSMT) with separate pose and audio transformers and a late fusion for next-step pose prediction.
  • Use Transformer blocks with multi-head self-attention and position-wise feed-forward layers to model long-range dependencies.
  • Train end-to-end on a large YouTube-derived Dance3D dataset with 3D pose estimation and beat-aware audio features.
  • Evaluate via physical plausibility in a Bullet-based humanoid simulator, beat consistency, and multiple diversity metrics.

Experimental results

Research questions

  • RQ1Can a conditional generative model produce diverse, beat-aligned dance motions given arbitrary music?
  • RQ2Does a large-scale YouTube-derived dataset enable better generalization than traditional mocap datasets?
  • RQ3Do two-stream Transformer architectures outperform baselines in terms of plausibility, beat tracking, and diversity?
  • RQ4What evaluation metrics best capture physical plausibility, musicality, and diversity in synthesized dances?

Key findings

  • The proposed TSMT model yields more diverse and plausible dances than acLSTM and ChorRNN baselines in non-audio and audio-enabled settings.
  • The discrete pose representation enables effective sampling of diverse poses at inference.
  • The two-stream design improvesbeat-consistency and motion diversity, with competitive or better physical plausibility metrics.
  • Training efficiency and real-time inference (24-fps) are demonstrated compared to LSTM-based baselines.
  • New evaluation metrics and a large-scale YouTube-Dance3D dataset are introduced to assess plausibility, beat alignment, and diversity.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.