QUICK REVIEW

[论文解读] Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

Siyuan Yang, Jun Liu|arXiv (Cornell University)|Mar 6, 2026

Human Pose and Action Recognition被引用 0

一句话总结

S2I 将三维骨架序列通过语义分割和时间堆叠转换为类似图像的数据，使得经过视觉预训练的模型（MAE/DiffMAE）能够学习骨架表示，并实现跨格式和通用骨架学习。

ABSTRACT

Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.

研究动机与目标

通过统一的图像-like 表示消除 3D 骨架数据与基于图像的视觉模型之间的模态差距。
在视觉预训练下实现自监督的骨架表示学习，以利用大规模视觉先验。
在异构骨架数据集之间支持跨格式和通用骨架表示学习。
在跨格式转移和通用预训练的基准上展示出强泛化能力。

提出的方法

将骨架关节点分成五个身体部位（躯干、左臂、右臂、左腿、右腿），按与躯干的距离对每个部位的关节点排序。
在时间上将关节点坐标堆叠形成 T x J x 3 的时空表示，并将 x,y,z 映射到 RGB 通道。
将得到的图像-like 表示调整为 224 x 224，以符合标准视觉模型输入。
在 S2I 表示上对图像为基础的模型（MAE 和 DiffMAE）进行预训练，使用掩码建模（重建或基于扩散的去噪）。
在下游骨架动作识别任务上进行微调或线性探测，使用标准交叉熵损失。

实验结果

研究问题

RQ1是否可以通过统一的 Skeleton-to-Image 表示将视觉预训练模型有效用于骨架分析？
RQ2S2I 是否能够在异构骨架数据集上实现鲁棒的跨格式与通用骨架表示学习？
RQ3在 S2I 框架中，哪种掩码策略和骨架模态最有利于自监督骨架学习？

主要发现

Skeleton-to-Image 编码使 MAE 和 DiffMAE 主干能够学习骨架表示，在线性探测和微调中具有竞争力的性能。
基于图像的预训练权重提供显著提升；通常 DiffMAE 相较 MAE 在 S2I 预训练中表现更好。
三流 S2I 融合（关节、运动、骨骼）在 NTU-60 C-sub、NTU-120 C-set 以及 PKU-II 的线性评估中达到状态-of-the-art。
在 NTU-60 的半监督设置中，标注数据仅 1% 时，S2I 获得 71.4%，而 3s-S2I 获得 75.2%（X-sub），显示出在有限标签下的强大性能。
跨格式迁移学习和通用预训练实验表明 S2I 能提升对异构骨架数据集的泛化能力，并有利于通用骨架表示学习。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。