QUICK REVIEW

[論文レビュー] Single-Shot Motion Completion with Transformer

Yinglin Duan, Tianyang Shi|arXiv (Cornell University)|Mar 1, 2021

Video Analysis and Summarization参考文献 36被引用数 29

ひとこと要約

トランスフォーマーを基盤とする非自己回帰モデルが、統合フレームワーク内で欠損モーションフレームを補完し、in-betweening、in-filling、blendingを実現し、LaFAN1で最先端の精度を達成します。

ABSTRACT

Motion completion is a challenging and long-discussed problem, which is of great significance in film and game applications. For different motion completion scenarios (in-betweening, in-filling, and blending), most previous methods deal with the completion problems with case-by-case designs. In this work, we propose a simple but effective method to solve multiple motion completion problems under a unified framework and achieves a new state of the art accuracy under multiple evaluation settings. Inspired by the recent great success of attention-based models, we consider the completion as a sequence to sequence prediction problem. Our method consists of two modules - a standard transformer encoder with self-attention that learns long-range dependencies of input motions, and a trainable mixture embedding module that models temporal information and discriminates key-frames. Our method can run in a non-autoregressive manner and predict multiple missing frames within a single forward propagation in real time. We finally show the effectiveness of our method in music-dance applications.

研究の動機と目的

単一の枠組みで、複数のシナリオ（in-betweening、in-filling、blending）にわたるモーション補完を動機づけ、定義する。
時系列情報とキーフレームの役割をモデル化する、学習可能な混合埋め込みを備えたトランスフォーマー系アーキテクチャを提案する。
リアルタイム推論のため、複数の欠損フレームを非自己回帰のワンショット予測で実現する。
座標系間の一貫性と運動現実感を高めるため、前方運動学FKと逆運動学IKの制約を組み込む。
公開データセット（LaFAN1、Anidance）と新しいダンスデータセットで評価し、最先端の性能を示す。

提案手法

マスクされた入力系列を処理するバックボーンとして、標準的なトランスフォーマーエンコーダ（BERT風）を用いる。
フレームを注釈付けするため、学習可能な位置埋め込みとキーフレーム埋め込みを組み合わせた学習可能な混合埋め込みを導入する。
トランスフォーマー処理前に、Conv1d時系列演算子を用いてモーション姿勢を連続トークンに変換する。
単一のフォワードパスで欠損フレームを予測し、非自己回帰・並列推論を可能にする。
姿勢再構成損失や運動学損失（FK/IK）を含む多タスク回帰損失で訓練し、物理的一貫性を担保する。
トランスフォーマー階の後に1D畳み込みヘッドで最終予測モーションを出力する。

実験結果

リサーチクエスチョン

RQ1統一されたTransformerベースのフレームワークは、モーション補完におけるin-betweening、in-filling、blendingを扱えるか？
RQ2学習可能な混合埋め込みは、補完タスクにおける時系列モデリングとキーフレーム識別を改善するか？
RQ3非自己回帰推論は、精度を損なうことなくリアルタイムの複数フレーム補完を達成できるか？
RQ4FKおよびIK損失は、グローバル座標系とローカル座標系での動作時に精度へどのような影響を与えるか？
RQ5提案手法は、標準ベンチマーク（LaFAN1）および実世界・創作データセット（Anidance、dance blending）でどのように性能を発揮するか？

主な発見

Method	L2Q (5)	L2Q (15)	L2Q (30)	L2P (5)	L2P (15)	L2P (30)	NPSS (5)	NPSS (15)	NPSS (30)
Zero-Vel	0.56	1.10	1.51	1.52	3.69	6.60	0.0053	0.0522	0.2318
Interp	0.22	0.62	0.98	0.37	1.25	2.32	0.0023	0.0391	0.2013
ERD-QV ( [16] )	0.17	0.42	0.69	0.23	0.65	1.28	0.0020	0.0258	0.1328
Ours (local w/o FK)	0.18	0.47	0.74	0.27	0.82	1.46	0.0020	0.0307	0.1487
Ours (local)	0.17	0.44	0.71	0.23	0.74	1.37	0.0019	0.0291	0.1430
Ours (global w/o ME & IK)	0.16	0.37	0.63	0.24	0.61	1.16	0.0018	0.0243	0.1284
Ours (global w/o IK)	0.14	0.36	0.61	0.21	0.57	1.11	0.0016	0.0238	0.1241
Ours* (global-full)	0.14	0.36	0.61	0.22	0.56	1.10	0.0016	0.0234	0.1222

本手法は複数設定でLaFAN1において最先端の精度を達成する。
非自己回帰の単一フォワード伝搬によりCPU上でリアルタイム推論を実現（例：1x30シーケンスを約0.025s）。
混合埋め込みとIK/FK損失は、L2Q、L2P、NPSS指標全般で精度を大幅に向上させる。
提案された損失を用いたグローバル座標系の予測は、一般にローカル座標系の設定より高い精度を示す。
本手法はin-betweening、in-filling、blendingへ一般化可能で、野外のキーフレーム配置にも対応。
定性的結果は、線形補間ベースラインよりも一貫性があり妥当なダンスモーションの改善を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。