QUICK REVIEW

[论文解读] Superman: Unifying Skeleton and Vision for Human Motion Perception and Generation

Xinshun Wang, Peiming Li|arXiv (Cornell University)|Feb 2, 2026

Human Pose and Action Recognition被引用 0

一句话总结

一个统一框架，将跨模态运动词汇 grounding 于视觉与几何，使单个多模态大模型能够从视频和骨架输入执行三维姿态估计、运动预测和关键帧之间的运动插值。它介绍了一个基于视觉引导的运动编码器（Vision-Guided Motion Tokenizer）以及一个可选的 MAFT（Motion-Aware Fine-Tuning）增强的多模态大模型，以提升运动任务表现。

ABSTRACT

Human motion analysis tasks, such as temporal 3D pose estimation, motion prediction, and motion in-betweening, play an essential role in computer vision. However, current paradigms suffer from severe fragmentation. First, the field is split between ``perception'' models that understand motion from video but only output text, and ``generation'' models that cannot perceive from raw visual input. Second, generative MLLMs are often limited to single-frame, static poses using dense, parametric SMPL models, failing to handle temporal motion. Third, existing motion vocabularies are built from skeleton data alone, severing the link to the visual domain. To address these challenges, we introduce Superman, a unified framework that bridges visual perception with temporal, skeleton-based motion generation. Our solution is twofold. First, to overcome the modality disconnect, we propose a Vision-Guided Motion Tokenizer. Leveraging the natural geometric alignment between 3D skeletons and visual data, this module pioneers robust joint learning from both modalities, creating a unified, cross-modal motion vocabulary. Second, grounded in this motion language, a single, unified MLLM architecture is trained to handle all tasks. This module flexibly processes diverse, temporal inputs, unifying 3D skeleton pose estimation from video (perception) with skeleton-based motion prediction and in-betweening (generation). Extensive experiments on standard benchmarks, including Human3.6M, demonstrate that our unified method achieves state-of-the-art or competitive performance across all motion tasks. This showcases a more efficient and scalable path for generative motion analysis using skeletons.

研究动机与目标

在单一模型中桥接感知与生成，用于人体运动分析。
使运动令牌在可视外观与三维几何上都有基础，以连接视频输入与基于骨架的任务。
开发一个多模态大语言模型，具备在一个结构中处理感知与生成任务的能力。

提出的方法

提出基于 VQ-VAE 的 Vision-Guided Motion Tokenizer (VGMT)，将视觉特征与三维骨骼几何信息融合成双模态令牌的混合代码本。
使用双流编码器提取视觉（基于帧）与骨骼（关节-时间）特征，然后在混合代码本上进行关节令牌量化。
用结合重建与模态层级承诺损失的 VQ 目标端到端训練编码器。
微调一个解码器仅的多模态大模型（Qwen2.5-VL-7B），自回归地预测多任务的运动令牌， optionally 通过一个注入骨骼几何到视觉特征的 Visual-Skeleton Attention (VSA) 的 Motion-Aware Fine-Tuning (MAFT) 模块增强。
将三个任务表述为条件序列生成：基于视频的三维姿态估计、基于 past 姿态的运动预测，以及关键帧之间的运动插值。

Figure 2 : Architecture of our Vision-Guided Motion Tokenizer (VGMT). VGMT creates a discrete motion vocabulary by jointly fusing information from two modalities. A Skeleton Encoder ( $E_{s}$ ) captures geometry while a Visual-Skeleton Attention (VSA) module and a subsequent Visual Encoder ( $E_{v}$

实验结果

研究问题

RQ1一个统一模型是否可以利用基于视觉与骨骼几何的跨模态运动词汇来感知并生成多任务的人体运动？
RQ2将运动令牌基于可视输入进行 grounding，是否比只用骨骼或只用视觉的编码在感知与生成任务上有更高的性能？
RQ3模型和代码本扩展、VSA/MAFT 消融以及统一多任务训练对性能与泛化有何影响？
RQ4在未见数据集（例如从 Human3.6M 到 3DPW）上，所提出框架在感知与生成任务上的泛化能力如何？

主要发现

统一的 Superman 框架在标准基准上实现了三维姿态估计、运动预测与关键帧之间的运动插值的最先进或具有竞争力的结果。
一个带有混合可视几何代码本的 Vision-Guided Motion Tokenizer 能有效实现跨模态姿态表示并提升下游任务表现。
MAFT 与 VSA 模块均对运动感知与生成有提升，二者结合获得最佳结果。
仅在 Human3.6M 上训练时，该模型仍能对未见数据（如 3DPW）表现良好，超越现有方法的泛化测试。
扩大模型规模与代码本容量可持续降低姿态误差，显示出可扩展性收益。

Figure 3 : Network architecture and training paradigm. Superman fine-tune a single LLM to integrate information from text, video, and 3D skeleton modalities. Optionally, a Motion-Aware Fine-Tuning (MAFT) module can be integrated. With $<$ 0.2% extra parameters, MAFT enhances motion perception by ena

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。