QUICK REVIEW

[论文解读] Controllable Complex Human Motion Video Generation via Text-to-Skeleton Cascades

Ashkan Taghipour, Morteza Ghahremani|arXiv (Cornell University)|Mar 9, 2026

Human Motion and Animation被引用 0

一句话总结

本论文提出一个两阶段级联框架，先从文本生成二维姿态序列，然后通过一个具备变形感知的扩散模型渲染姿态驱动的视频，同时给出用于复杂动作的Blender合成数据集；在 text-to-pose 和 pose-to-video 任务上达到最新的指标。

ABSTRACT

Generating videos of complex human motions such as flips, cartwheels, and martial arts remains challenging for current video diffusion models. Text-only conditioning is temporally ambiguous for fine-grained motion control, while explicit pose-based controls, though effective, require users to provide complete skeleton sequences that are costly to produce for long and dynamic actions. We propose a two-stage cascaded framework that addresses both limitations. First, an autoregressive text-to-skeleton model generates 2D pose sequences from natural language descriptions by predicting each joint conditioned on previously generated poses. This design captures long-range temporal dependencies and inter-joint coordination required for complex motions. Second, a pose-conditioned video diffusion model synthesizes videos from a reference image and the generated skeleton sequence. It employs DINO-ALF (Adaptive Layer Fusion), a multi-level reference encoder that preserves appearance and clothing details under large pose changes and self-occlusions. To address the lack of publicly available datasets for complex human motion video generation, we introduce a Blender-based synthetic dataset containing 2,000 videos with diverse characters performing acrobatic and stunt-like motions. The dataset provides full control over appearance, motion, and environment. It fills an important gap because existing benchmarks significantly under-represent acrobatic motions while web-collected datasets raise copyright and privacy concerns. Experiments on our synthetic dataset and the Motion-X Fitness benchmark show that our text-to-skeleton model outperforms prior methods on FID, R-precision, and motion diversity. Our pose-to-video model also achieves the best results among all compared methods on VBench metrics for temporal consistency, motion smoothness, and subject preservation.

研究动机与目标

促进对非重复、复杂人类动作（如翻滚、杂技等）在文本条件之外的可控生成的研究动机。
将运动规划与外观合成解耦，以实现可编辑、姿态精确的控制。
开发一个自回归文本到骨架模型，以从语言生成时间上连贯的二维姿态序列。
引入 DINO-ALF，一种多层外观条件机制，在大姿态变化下实现鲁棒视频合成。
提供一个复杂动作视频的合成数据集，用于基准测试和训练应对挑战性动作的模型。

提出的方法

阶段1：使用自回归 Transformer 将自然语言转换为离散的关节标记的二维姿态序列的文本到骨架生成。
姿态表示离散化将连续坐标映射到标记ID，并将帧/关节序列化为一维标记流。
文本条件通过一个冻结的 CLIP 文本编码器提供，其嵌入拼接在姿态标记前作为持续的条件前缀。
阶段2：使用扩散骨架为条件的姿态驱动视频生成，条件来自参考图像和生成的骨架。
DINO-ALF 将多层 DINOv3 patch 描述符融合，以在大变形和自遮挡下保持外观，将基于 CLIP 的条件替换为基于 DINO 的跨注意力。
训练使用 LoRA 适配器来微调冻结的扩散骨架，并采用条件丢失来增强鲁棒性。
引入一个合成的 Blender 基础数据集，包含 2,000 个复杂动作视频，用于评估和训练杂技和特技类动作。

实验结果

研究问题

RQ1一个自回归文本到骨架模型是否能够为高度动态、非重复性动作生成可靠且可控的二维姿态序列？
RQ2在大姿态变化下，变形感知的骨架引导扩散模型是否能够保持外观并实现时间连贯性？
RQ3复杂动作的合成数据集在现有杂技和特技基准中能填补哪些空白？
RQ4多层 DINO-ALF 的外观线索相对于基于 CLIP 的条件在姿态驱动视频合成中带来怎样的提升？

主要发现

Methods	FID	Rp-top1	Rp-top2	Rp-top3	Diversity	MM-Dist
T2M-GPT	524.61	0.191	0.287	0.473	40.11	49.85
PriorMDM	585.31	0.216	0.325	0.501	42.58	44.29
MLD	467.22	0.335	0.503	0.653	41.67	47.66
HumanDreamer	322.16	0.411	0.598	0.722	45.33	41.53
Ours	255.19	0.487	0.667	0.784	48.33	38.65

文本到骨架模型在文本到姿态评估中，在 FID、R-precision 和运动多样性方面优于基线。
姿态驱动视频模型在 VBench 指标上的时间一致性、运动平滑性和主体保持性方面，在比较方法中取得最佳结果。
所提出的两阶段级联有效地将运动规划与外观合成解耦，实现对复杂动作的可控性。
基于 Blender 的合成数据集提供了 2,000 个多样的杂技/特技类动作视频，弥补了数据集空缺并解决隐私问题。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。