QUICK REVIEW

[論文レビュー] Learning to Generate Diverse Dance Motions with Transformer

Jiaman Li, Yihang Yin|arXiv (Cornell University)|Aug 18, 2020

Human Motion and Animation参考文献 22被引用数 74

ひとこと要約

本論文は、two-stream Transformer model (TSMT) が音楽を条件に、大規模な YouTube由来の 3D ポーズデータから多様で長距離のダンスモーションを合成し、新しい評価指標と BEAT-aware、diversity-focused 評価フレームワークを提案します。

ABSTRACT

With the ongoing pandemic, virtual concerts and live events using digitized performances of musicians are getting traction on massive multiplayer online worlds. However, well choreographed dance movements are extremely complex to animate and would involve an expensive and tedious production process. In addition to the use of complex motion capture systems, it typically requires a collaborative effort between animators, dancers, and choreographers. We introduce a complete system for dance motion synthesis, which can generate complex and highly diverse dance sequences given an input music sequence. As motion capture data is limited for the range of dance motions and styles, we introduce a massive dance motion data set that is created from YouTube videos. We also present a novel two-stream motion transformer generative model, which can generate motion sequences with high flexibility. We also introduce new evaluation metrics for the quality of synthesized dance motions, and demonstrate that our system can outperform state-of-the-art methods. Our system provides high-quality animations suitable for large crowds for virtual concerts and can also be used as reference for professional animation pipelines. Most importantly, we show that vast online videos can be effective in training dance motion models.

研究の動機と目的

音楽を条件にして多様なモーションを生成できる、スケーラブルでデータ駆動のダンスモーション合成アプローチを動機づける。
従来のモーションキャプチャデータセットの制約を克服するために、大規模でウェブ由来のダンスデータを活用する。
長期依存性と音楽とダンスの相関を捉える二-stream Transformer アーキテクチャを開発する。
物理的妥当性、ビート整合性、モーションの多様性を評価する指標を導入する。

提案手法

dance synthesis を、音楽と過去のモーションを条件とする自己回帰的生成モデルとして定式化する。
連続3Dジョイントポーズを離散カテゴリとして表現し、多様なサンプリングを可能にする。
Two-Stream Motion Transformer (TSMT) を、別々のポーズとオーディオのトランスフォーマー、および次ステップのポーズ予測の遅融合で導入する。
長距離依存性をモデル化するために、マルチヘッド自己注意機構と位置ごとのフィードフォワード層を備えた Transformer ブロックを用いる。
3Dポーズ推定と beat-aware オーディオ特徴を用いた YouTube由来の Dance3D データセットでエンドツーエンドに訓練する。
Bullet ベースのヒューマノイドシミュレータで物理的妥当性、ビートの一貫性、複数の多様性指標を用いて評価する。

実験結果

リサーチクエスチョン

RQ1任意の音楽を与えられたとき、条件付き生成モデルは多様でビートに整合したダンスモーションを生成できるか？
RQ2大規模な YouTube由来データセットは、従来の mocap データセットよりも一般化性能を高めるか？
RQ3Two-stream Transformer アーキテクチャは、妥当性、ビート追従性、多様性の点でベースラインを上回るか？
RQ4合成ダンスにおける物理的妥当性、音楽性、多様性を最もよく捉える評価指標は何か？

主な発見

提案手法の TSMT モデルは、音声なし設定および音声使用設定の両方で acLSTM および ChorRNN ベースラインよりも多様で妥当なダンスを示す。
離散ポーズ表現により、推論時の多様なポーズのサンプリングが効果的に機能する。
二経路設計はビート整合性とモーション多様性を改善し、物理的妥当性指標も競合的または上回る。
LSTMベースのベースラインと比較して訓練効率とリアルタイム推論（24-fps）が示される。
新しい評価指標と大規模な YouTube-Dance3D データセットが、妥当性、ビート整合性、多様性を評価するために導入される。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。