Skip to main content
QUICK REVIEW

[论文解读] UniHM: Unified Dexterous Hand Manipulation with Vision Language Model

Zhenhao Zhang, Jiaxin Liu|arXiv (Cornell University)|Feb 28, 2026
Robot Manipulation and Learning被引用 0
一句话总结

UniHM 引入一个统一、语言条件化的框架,跨多种形态进行动态灵巧手部操作,利用形态无关的标记码本、视觉-语言模型,以及基于物理的 refinement,从开放词汇指令中生成可执行的操作序列。

ABSTRACT

Planning physically feasible dexterous hand manipulation is a central challenge in robotic manipulation and Embodied AI. Prior work typically relies on object-centric cues or precise hand-object interaction sequences, foregoing the rich, compositional guidance of open-vocabulary instruction. We introduce UniHM, the first framework for unified dexterous hand manipulation guided by free-form language commands. We propose a Unified Hand-Dexterous Tokenizer that maps heterogeneous dexterous-hand morphologies into a single shared codebook, improving cross-dexterous hand generalization and scalability to new morphologies. Our vision language action model is trained solely on human-object interaction data, eliminating the need for massive real-world teleoperation datasets, and demonstrates strong generalizability in producing human-like manipulation sequences from open-ended language instructions. To ensure physical realism, we introduce a physics-guided dynamic refinement module that performs segment-wise joint optimization under generative and temporal priors, yielding smooth and physically feasible manipulation sequences. Across multiple datasets and real-world evaluations, UniHM attains state-of-the-art results on both seen and unseen objects and trajectories, demonstrating strong generalization and high physical feasibility. Our project page at \href{https://unihm.github.io/}{https://unihm.github.io/}.

研究动机与目标

  • 说明需要开放词汇、动态 dexterous 手部操作超越静态抓握的动机。
  • 提出一种形态无关的标记化方案以实现跨手的泛化。
  • 开发一个基于语言与感知数据条件化生成操作序列的视觉-语言模型。
  • 结合物理引导的动态 refinement,以确保轨迹的物理可行性。
  • 通过广泛评估展示对未见对象、形态和任务的鲁棒泛化。

提出的方法

  • 引入一个统一的手部 dexterous 标记器,使用共享的 VQ-VAE 码本,将异构手部姿态映射到一个共同的离散动作格。
  • 使用带有 CLIPort 风格感知头的视觉-语言模型,从 RGB-D 和语言中推断目标轨迹,然后进行基于标记的序列生成。
  • 通过知识蒸馏对新手部形态进行训练和对齐,使其对参考编码器实现跨 dexterous 手的标记重用,并以形态特定解码器进行解码。
  • 应用物理引导的动态 refinement,使用高斯-牛顿框架在每帧中结合接触、生成先验和时间先验来优化以实现物理可行性。
  • 从人类视频标注 HOI 序列,进行 Dex-Retargeting 将 MANO 姿态映射到多种灵巧手,并通过能量约束对轨迹进行 refined。
Figure 1: Overview. We introduce UniHM, the first unified hand-manipulation framework conditioned on free-form language. UniHM is trained solely on closed-set HOI datasets to follow target trajectories and execute physically feasible interactions, and generalizes to open-world tasks in real-world in
Figure 1: Overview. We introduce UniHM, the first unified hand-manipulation framework conditioned on free-form language. UniHM is trained solely on closed-set HOI datasets to follow target trajectories and execute physically feasible interactions, and generalizes to open-world tasks in real-world in

实验结果

研究问题

  • RQ1是否可以使用开放词汇的语言命令,在不同手部形态下生成动态的、多步的 dexterous 手部操作轨迹?
  • RQ2形态无关的标记码本是否实现跨手的一致性和有效迁移?
  • RQ3物理引导的 refinement 如何提升生成序列的时间平滑性与物理可行性?
  • RQ4从人类 HOI 视频学习是否可以消除对昂贵的遥操作数据的依赖,同时对未见对象与任务保持泛化能力?

主要发现

MethodMPJPE ↓FOL ↓FPL ↓FID ↓Diversity →
Ours (DexYCB Seen)61.40 ${}^{\pm1.93}$23.14 ${}^{\pm0.65}$12.15 ${}^{\pm0.24}$31.24 ${}^{\pm1.02}$39.62 ${}^{\pm0.66}$
Ours (OakInk Seen)52.73 ${}^{\pm2.08}$72.32 ${}^{\pm0.55}$19.86 ${}^{\pm0.43}$204.91 ${}^{\pm7.64}$165.47 ${}^{\pm6.30}$
Ours (DexYCB Unseen)63.56 ${}^{\pm2.08}$27.29 ${}^{\pm0.43}$13.06 ${}^{\pm0.43}$41.03 ${}^{\pm1.65}$42.70 ${}^{\pm1.19}$
Ours (OakInk Unseen)58.62 ${}^{\pm2.35}$83.27 ${}^{\pm1.17}$22.87 ${}^{\pm0.52}$253.41 ${}^{\pm13.05}$153.28 ${}^{\pm9.48}$
  • UniHM 在 DexYCB 与 OakInk 上对已见与未见对象和轨迹实现了最先进的结果。
  • 形态无关的码本实现了 MANO 与多种机器人手之间的跨手一致性与标记重用。
  • 物理引导的动态 refinement 产生更平滑、物理可行的轨迹,提升接触处理与稳定性。
  • 真实世界实验中对已见与未见对象的抓取成功率高于前沿方法。
  • 消融研究表明掩码训练、RGB-D 输入以及物理 refinement 均对性能和可行性有贡献。
Figure 2: Pipeline. UniHM converts open-vocabulary instructions and RGB-D inputs into executable dexterous-hand trajectories via three stages: (1) morphology-agnostic motion tokenization; (2) language-guided generation that fuses text, perception, and token history to produce manipulation token sequ
Figure 2: Pipeline. UniHM converts open-vocabulary instructions and RGB-D inputs into executable dexterous-hand trajectories via three stages: (1) morphology-agnostic motion tokenization; (2) language-guided generation that fuses text, perception, and token history to produce manipulation token sequ

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。