QUICK REVIEW

[论文解读] Single-Shot Motion Completion with Transformer

Yinglin Duan, Tianyang Shi|arXiv (Cornell University)|Mar 1, 2021

Video Analysis and Summarization参考文献 36被引用 29

一句话总结

一个基于 Transformer 的非自回归模型，在统一框架中完成缺失运动帧，用于插帧、中填充和混合，在 LaFAN1 上达到最先进的准确性。

ABSTRACT

Motion completion is a challenging and long-discussed problem, which is of great significance in film and game applications. For different motion completion scenarios (in-betweening, in-filling, and blending), most previous methods deal with the completion problems with case-by-case designs. In this work, we propose a simple but effective method to solve multiple motion completion problems under a unified framework and achieves a new state of the art accuracy under multiple evaluation settings. Inspired by the recent great success of attention-based models, we consider the completion as a sequence to sequence prediction problem. Our method consists of two modules - a standard transformer encoder with self-attention that learns long-range dependencies of input motions, and a trainable mixture embedding module that models temporal information and discriminates key-frames. Our method can run in a non-autoregressive manner and predict multiple missing frames within a single forward propagation in real time. We finally show the effectiveness of our method in music-dance applications.

研究动机与目标

在一个统一框架下，推动并定义跨多个场景（in-betweening、in-filling、blending）的运动补全。
提出一种基于 Transformer 的架构，配备可学习的混合嵌入，用于建模时间信息和关键帧角色。
实现非自回归的一次性对多个缺失帧的预测，以实现实时推理。
结合正向和逆向运动学约束，以提升运动真实感以及跨坐标系的一致性。
在公开数据集（LaFAN1、Anidance）和一个新的舞蹈数据集上进行评估，展示最先进的性能。

提出的方法

使用标准的 transformer 编码器（BERT 风格）作为骨干网络来处理被掩码的输入序列。
引入一个可学习的混合嵌入，将可学习的位置嵌入与关键帧嵌入结合起来对帧进行注释。
在 Transformer 处理前，将运动姿态通过 Conv1d 时间算子转换为序列令牌。
在一个前向传播中预测缺失帧，支持非自回归的并行推理。
使用多任务回归损失进行训练，包括姿态重建损失和运动学损失（FK/IK），以强化物理上的合理性。
在 Transformer 堆栈后通过一个一维卷积头输出最终预测的运动。

实验结果

研究问题

RQ1一个统一的基于 Transformer 的框架是否能够处理运动完成中的 in-betweening、in-filling 和 blending？
RQ2可学习的混合嵌入是否能够改善用于完成任务的时间建模与关键帧判别？
RQ3非自回归推理是否能够在不牺牲准确性的前提下实现实时多帧补全？
RQ4在全局坐标与局部坐标下，FK 和 IK 损失对准确性有何影响？
RQ5在标准基准（LaFAN1）以及现实世界/创意数据集（Anidance、dance blending）上的表现如何？

主要发现

方法	L2Q (5)	L2Q (15)	L2Q (30)	L2P (5)	L2P (15)	L2P (30)	NPSS (5)	NPSS (15)	NPSS (30)
Zero-Vel	0.56	1.10	1.51	1.52	3.69	6.60	0.0053	0.0522	0.2318
Interp	0.22	0.62	0.98	0.37	1.25	2.32	0.0023	0.0391	0.2013
ERD-QV ( [16] )	0.17	0.42	0.69	0.23	0.65	1.28	0.0020	0.0258	0.1328
Ours (local w/o FK)	0.18	0.47	0.74	0.27	0.82	1.46	0.0020	0.0307	0.1487
Ours (local)	0.17	0.44	0.71	0.23	0.74	1.37	0.0019	0.0291	0.1430
Ours (global w/o ME & IK)	0.16	0.37	0.63	0.24	0.61	1.16	0.0018	0.0243	0.1284
Ours (global w/o IK)	0.14	0.36	0.61	0.21	0.57	1.11	0.0016	0.0238	0.1241
Ours* (global-full)	0.14	0.36	0.61	0.22	0.56	1.10	0.0016	0.0234	0.1222

该方法在 LaFAN1 上在多种设置下达到最先进的准确性。
非自回归的单次前向传播使 CPU 上实现实时推理成为可能（例如 1x30 序列约需 ~0.025s）。
混合嵌入与 IK/FK 损失在 L2Q、L2P 和 NPSS 指标上显著提高准确性。
在全局坐标预测下，结合所提出的损失，通常比局部坐标设置具有更高的准确性。
该方法可泛化到 in-betweening、in-filling 和 blending，包括野外环境中的关键帧排列。
定性结果显示相比线性插值基线，舞蹈运动更加连贯、合理。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。