QUICK REVIEW

[論文レビュー] Structural Action Transformer for 3D Dexterous Manipulation

Xiaohan Lei, Min Wang|arXiv (Cornell University)|Mar 4, 2026

Robot Manipulation and Learning被引用数 0

ひとこと要約

The paper introduces Structural Action Transformer (SAT), a 3D dexterous manipulation policy that tokenizes actions by joint trajectories ((Da, T)) rather than time-sliced vectors, enabling better cross-embodiment transfer and data efficiency. It uses a structural action codebook and a continuous-time flow matching objective to generate action chunks from 3D point clouds and language inputs.

ABSTRACT

Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

研究の動機と目的

3D point clouds と language inputs を使用した高自由度（DoF）巧緻手の跨エンボディメント模倣学習を動機付ける。
Embodiment間で可変ジョイント数を可能にする構造中心のアクション表現（Da × T）を提案する。
移動関数と結合するジョイント機能と運動学をエンコードするEmbodied Joint Codebookを導入し移転を促進する。
異種データセットで事前学習を行い、シミュレーションと実世界のタスクでファインチューニングしてサンプル効率と一般化を評価する。

提案手法

アクションをジョイント軌跡のシーケンスとして表現する：A_t in R^{Da × T}、各行はジョイントの将来軌跡。
連続時間正規化フロー（CNF）を用いて p(A_t | o_t) を条件付き速度場 ε_θ によってモデル化し、フローマッチング目的で訓練する。
階層的な3D点群トークナイザとT5ベースの言語エンコーダで観測をエンコードし、DiTトランスフォーマーを多模態入力で条件付けする。
Embodied Joint Codebook を組み込み、各ジョイントを（ Embodiment, Function, Rotation ）の3要素トリプレットにマップして形状の異なるジョイントを整合させる。
因果マスキングを用いたTransformerベースのDiT でアクション速度場を予測し、ODEソルバーを介して最終アクションチャンクを得る。
大規模な異種データセット（人間とロボットのデモ、シミュレーション）で事前学習を行い、下流タスクでファインチューニングする。評価は Adroit、DexArt、Bi-DexHands、実世界の二手両手タスクで行う。

Figure 1 : Conceptual illustration of action chunk tokenization. (a) The conventional temporal-centric perspective, which structures actions as a sequence of $T$ timesteps (chunk length), with each token having dimension $D_{a}$ (action dim). (b) Our proposed structural-centric perspective, which re

実験結果

リサーチクエスチョン

RQ1構造的（Da × T）アクション表現は、従来の時間的（T × Da）表現と比較して高自由度巧緻手の跨エンボディメント移転を改善するか。
RQ2Embodied Joint Codebook は多様なマニピュレータ間で機能的移転を可能にしつつサンプル効率を維持できるか。
RQ33D点群観測と言語条件付けは、シム-実機ギャップを越えた巧緻操作ポリシー学習を支援できるか。
RQ4事前学習データの構成が下流の巧緻操作性能と少数ショット適応に与える影響はどの程度か。

主な発見

Model	Params (M)	Modality	Adroit (3)	DexArt (4)	Bi-DexHands (4)	Average Success
Diffusion Policy	266.8	2D	0.32±0.03	0.49±0.04	0.42±0.05	0.42±0.04
HPT	13.99	2D	0.45±0.02	0.53±0.05	0.44±0.04	0.47±0.04
UniAct	1053	2D	0.49±0.01	0.55±0.03	0.47±0.07	0.50±0.05
3D Diffusion Policy	255.2	3D	0.68±0.03	0.69±0.02	0.55±0.14	0.63±0.06
3D ManiFlow Policy	218.9	3D	0.70±0.02	0.70±0.03	0.59±0.07	0.66±0.04
SAT (Ours)	19.36	3D	0.75±0.02	0.73±0.03	0.67±0.05	0.71±0.04

SAT は Adroit、DexArt、Bi-DexHands の11タスクを通じて2Dおよび3Dベースラインを一貫して上回る。
SAT は最終平均成功率0.71を達成、パラメータ数は19.36Mで、多くのベースラインより大幅に小さい。
埋め込み次元 d_feat を介した時間的圧縮は頑健であり、非常に高く圧縮された場合を除き性能は劣化しない（例：d_feat = 16 の場合のみ劣化）。
混合の Human/Robot/Simulation データでの事前学習は高い性能を発揮；人間のみの事前学習はタスクによってはSimulationのみデータより劣る場合があり、コードブックによる跨エンボディメント移転の有効性を示す。
アブレーションにより Embodied Joint Codebook を削除するか、時間中心のアクションへ戻すと性能が著しく低下することが示され、ジョイント埋め込みが学習にとって重要である。
実世界の実験では SAT が HPT や 3DDP を上回り、6つのバイマンデュアルタスクでテレオペレーションベースのデータ収集を通じて高い成功率を達成。

Figure 2 : Our proposed model architecture. The policy takes a history of $T_{o}$ raw 3D point clouds $\mathcal{P}_{t}=(\mathbf{P}_{t-T_{o}+1},\dots,\mathbf{P}_{t})$ and a language instruction $L$ as input. Observation Tokenizer : Each point cloud $\mathbf{P}_{k}$ in the history is processed via Far

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。