QUICK REVIEW

[论文解读] Structural Action Transformer for 3D Dexterous Manipulation

Xiaohan Lei, Min Wang|arXiv (Cornell University)|Mar 4, 2026

Robot Manipulation and Learning被引用 0

一句话总结

该论文提出 Structural Action Transformer (SAT)，是一种3D灵巧操作策略，通过将动作按关节轨迹（Da × T）进行标记化，而非按时间分割的向量，提升了跨形态迁移和数据效率。它使用结构化动作码本和连续时间流匹配目标，从3D点云和语言输入中生成动作块。

ABSTRACT

Achieving human-level dexterity in robots via imitation learning from heterogeneous datasets is hindered by the challenge of cross-embodiment skill transfer, particularly for high-DoF robotic hands. Existing methods, often relying on 2D observations and temporal-centric action representation, struggle to capture 3D spatial relations and fail to handle embodiment heterogeneity. This paper proposes the Structural Action Transformer (SAT), a new 3D dexterous manipulation policy that challenges this paradigm by introducing a structural-centric perspective. We reframe each action chunk not as a temporal sequence, but as a variable-length, unordered sequence of joint-wise trajectories. This structural formulation allows a Transformer to natively handle heterogeneous embodiments, treating the joint count as a variable sequence length. To encode structural priors and resolve ambiguity, we introduce an Embodied Joint Codebook that embeds each joint's functional role and kinematic properties. Our model learns to generate these trajectories from 3D point clouds via a continuous-time flow matching objective. We validate our approach by pre-training on large-scale heterogeneous datasets and fine-tuning on simulation and real-world dexterous manipulation tasks. Our method consistently outperforms all baselines, demonstrating superior sample efficiency and effective cross-embodiment skill transfer. This structural-centric representation offers a new path toward scaling policies for high-DoF, heterogeneous manipulators.

研究动机与目标

促成高自由度灵巧手对3D点云和语言输入的跨形态模仿学习的动机。
提出以结构为中心的动作表示（Da × T），以实现不同形态之间的可变关节数。
引入 Embodied Joint Codebook，编码关节功能和运动学以实现迁移。
在异构数据集上进行预训练，并在仿真和真实世界任务上进行微调，以评估样本效率和泛化。

提出的方法

将动作表示为关节轨迹序列：A_t ∈ R^{Da × T}，其中每一行是一个关节的未来轨迹。
使用连续时间正则化流（CNF）通过条件速度场 ε_θ 来建模 p(A_t | o_t)，并以流动匹配目标进行训练。
用分层3D点云分词器和基于T5的语言编码器对观测进行编码，以在多模态输入上条件化 DiT transformer。
引入 Embodied Joint Codebook，将每个关节映射到一个3部分三元组（Embodiment、Function、Rotation），以使跨形态的关节对齐。
用基于 Transformer 的 DiT 预测动作速度场，使用因果掩蔽并通过ODE求解器进行积分以得到最终的动作块。
在大型异构数据集（人类和机器人示范、仿真）上进行预训练，并在下游任务上微调；在 Adroit、DexArt、Bi-DexHands 和真实世界的双手任务上评估。

Figure 1 : Conceptual illustration of action chunk tokenization. (a) The conventional temporal-centric perspective, which structures actions as a sequence of $T$ timesteps (chunk length), with each token having dimension $D_{a}$ (action dim). (b) Our proposed structural-centric perspective, which re

实验结果

研究问题

RQ1结构化的（Da × T）动作表示是否相较传统的时间性（T × Da）表示提高了高自由度灵巧手的跨形态迁移？
RQ2 Embodied Joint Codebook 是否能够在保持样本效率的同时实现对多样化操控器的功能迁移？
RQ3带语言条件的3D点云观测是否支持对灵巧操作的有效策略学习，以跨越仿真到真实的差距？
RQ4预训练数据组成对下游灵巧操控性能和少样本适应性的影响？

主要发现

SAT 在 Adroit、DexArt、Bi-DexHands 基准的11个任务上持续超越2D和3D基线。
SAT 以19.36M参数实现最终平均成功率0.71，显著小于许多基线模型。
通过嵌入维度 d_feat 进行时间压缩具有鲁棒性；当压缩极高（如 d_feat = 16）时性能下降。
混合的人类/机器人/仿真数据进行的预训练表现良好；仅人类预训练在某些任务上可能不及仅仿真数据，凸显通过码本实现的跨形态迁移的有效性。
消融实验表明去除 Embodied Joint Codebook 或回退到时间性行动会显著降低性能；关节嵌入对学习至关重要。
真实世界实验显示 SAT 在6个双手任务的成功率上高于 HPT 和 3DDP，且数据采集基于远距操作。

Figure 2 : Our proposed model architecture. The policy takes a history of $T_{o}$ raw 3D point clouds $\mathcal{P}_{t}=(\mathbf{P}_{t-T_{o}+1},\dots,\mathbf{P}_{t})$ and a language instruction $L$ as input. Observation Tokenizer : Each point cloud $\mathbf{P}_{k}$ in the history is processed via Far

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。