QUICK REVIEW

[论文解读] Learning to Generate Diverse Dance Motions with Transformer

Jiaman Li, Yihang Yin|arXiv (Cornell University)|Aug 18, 2020

Human Motion and Animation参考文献 22被引用 74

一句话总结

本论文提出一种两流 Transformer 模型（TSMT），以音乐为条件，从大规模 YouTube 派生的 3D 姿态数据中合成多样化、长距离的舞蹈动作，并引入新的评估指标及 BEAT 感知、注重多样性的评估框架。

ABSTRACT

With the ongoing pandemic, virtual concerts and live events using digitized performances of musicians are getting traction on massive multiplayer online worlds. However, well choreographed dance movements are extremely complex to animate and would involve an expensive and tedious production process. In addition to the use of complex motion capture systems, it typically requires a collaborative effort between animators, dancers, and choreographers. We introduce a complete system for dance motion synthesis, which can generate complex and highly diverse dance sequences given an input music sequence. As motion capture data is limited for the range of dance motions and styles, we introduce a massive dance motion data set that is created from YouTube videos. We also present a novel two-stream motion transformer generative model, which can generate motion sequences with high flexibility. We also introduce new evaluation metrics for the quality of synthesized dance motions, and demonstrate that our system can outperform state-of-the-art methods. Our system provides high-quality animations suitable for large crowds for virtual concerts and can also be used as reference for professional animation pipelines. Most importantly, we show that vast online videos can be effective in training dance motion models.

研究动机与目标

激发一种可扩展、数据驱动的舞蹈动作合成方法，能够在音乐条件下产生多样化的动作。
利用大规模、网页来源的舞蹈数据来克服有限的动作捕捉数据集。
开发一个两流 Transformer 架构，能够捕捉长期依赖关系以及音乐-舞蹈相关性。
引入用于物理可信度、节拍一致性和动作多样性的评估指标。

提出的方法

将舞蹈合成表述为一个自回归生成模型，条件是音乐和过去的动作。
将连续的 3D 关节点姿态表示为离散类别，以实现多样化采样。
引入 Two-Stream Motion Transformer (TSMT)，具备单独的姿态和音频 Transformer，以及用于下一步姿态预测的后期融合。
使用带多头自注意力和逐位置前馈层的 Transformer 块来建模长程依赖。
在包含 3D 姿态估计和带节拍感知的音频特征的大规模 YouTube 派生 Dance3D 数据集上端到端训练。
通过基于 Bullet 的 humanoid 仿真器评估物理可信度、节拍一致性以及多种多样性指标。

实验结果

研究问题

RQ1给定任意音乐，条件生成模型是否能够产生多样且与节拍对齐的舞蹈动作？
RQ2大规模的 YouTube 派生数据集是否比传统 mocap 数据集具有更好的泛化能力？
RQ3两流 Transformer 架构在可信度、节拍跟踪和多样性方面是否优于基线？
RQ4哪些评估指标最能准确反映合成舞蹈的物理可信度、音乐性和多样性？

主要发现

所提出的 TSMT 模型在无音频和启用音频的设置下，产生的舞蹈比 acLSTM 和 ChorRNN 基线更具多样性和可信度。
离散姿态表示使推理阶段能够有效采样多样化的姿态。
两流设计提高了节拍一致性和动作多样性，物理可信度指标具有竞争力甚至更好。
相较于基于 LSTM 的基线，展示了训练效率和实时推理（24 帧/秒）。
引入新的评估指标和大规模的 YouTube-Dance3D 数据集，用于评估可信度、节拍对齐和多样性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。