QUICK REVIEW

[論文レビュー] Skeleton-to-Image Encoding: Enabling Skeleton Representation Learning via Vision-Pretrained Models

Siyuan Yang, Jun Liu|arXiv (Cornell University)|Mar 6, 2026

Human Pose and Action Recognition被引用数 0

ひとこと要約

S2I は 3D 骨格シーケンスを意味的分割と時間スタックによって画像のようなデータへ変換し、視覚機能 pretrained モデル（MAE/DiffMAE）が骨格表現を学習できるようにし、フォーマット横断と普遍的な骨格学習を可能にする。

ABSTRACT

Recent advances in large-scale pretrained vision models have demonstrated impressive capabilities across a wide range of downstream tasks, including cross-modal and multi-modal scenarios. However, their direct application to 3D human skeleton data remains challenging due to fundamental differences in data format. Moreover, the scarcity of large-scale skeleton datasets and the need to incorporate skeleton data into multi-modal action recognition without introducing additional model branches present significant research opportunities. To address these challenges, we introduce Skeleton-to-Image Encoding (S2I), a novel representation that transforms skeleton sequences into image-like data by partitioning and arranging joints based on body-part semantics and resizing to standardized image dimensions. This encoding enables, for the first time, the use of powerful vision-pretrained models for self-supervised skeleton representation learning, effectively transferring rich visual-domain knowledge to skeleton analysis. While existing skeleton methods often design models tailored to specific, homogeneous skeleton formats, they overlook the structural heterogeneity that naturally arises from diverse data sources. In contrast, our S2I representation offers a unified image-like format that naturally accommodates heterogeneous skeleton data. Extensive experiments on NTU-60, NTU-120, and PKU-MMD demonstrate the effectiveness and generalizability of our method for self-supervised skeleton representation learning, including under challenging cross-format evaluation settings.

研究の動機と目的

3D 骨格データと言語的な視覚モデルのモダリティギャップを unified な画像様 representation により橋渡しする。
視覚 pretrained による自己教師付き骨格表現学習を可能にし、大規模ビジュアル priors を活用する。
異種の骨格データセット間でのフォーマット横断および普遍的骨格表現学習をサポートする。
評価指標間のフォーマット横断転移と普遍的事前学習で強い一般化を示す。

提案手法

骨格の関節を5つの体部（ torso、左肩、右肩、左足、右足）に分割し、各部内の関節を torso への近さで並べ替える。
時間軸に渡って関節座標を積み重ねて T x J x 3 の時空表現を形成し、x,y,z を RGB チャンネルにマッピングする。
得られた画像状表現を 224 x 224 にリサイズして標準的な vision-model 入力に揃える。
S2I 表現上でマスク付きモデリング（再構成または拡散ベースのデノイジング）を用いて image-based モデル（MAE および DiffMAE）を事前学習させる。
標準的なクロスエントロピー損失で下流の骨格アクション認識タスクをファインチューニングまたは線形プローブする。

実験結果

リサーチクエスチョン

RQ1視覚 pretrained モデルを unified な Skeleton-to-Image 表現を通じて骨格分析に効果的に再利用できるか。
RQ2S2I は異種の骨格データセット間で robust なフォーマット横断および普遍的な骨格表現学習を実現するか。
RQ3S2I フレームワークでどのマスキング戦略と骨格モダリティが自己教師付き骨学習を最も支援するか。

主な発見

Skeleton-to-Image エンコーディングにより MAE および DiffMAE バックボーンが骨格表現を学習でき、線形評価およびファインチューニングで競争力を示す。
画像 pretrained ウェイトは顕著な利得を提供し、一般に DiffMAE が MAE よりバックボーンとして優れる。
3-stream S2I 融合（関節、運動、骨）は NTU-60 C-sub、NTU-120 C-set、PKU-II で線形評価において最先端の結果を達成。
NTU-60 の半教師付き設定でラベル付きデータが 1% のとき 71.4%（S2I）および 75.2%（3s-S2I）を示し、限定ラベル下で強い性能を示す。
フォーマット横断転移学習と普遍的事前学習の実験は S2I が異種の骨格データセット間で一般化を改善し、普遍的な骨格表現学習に有益であることを示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。