QUICK REVIEW

[论文解读] TextOp: Real-time Interactive Text-Driven Humanoid Robot Motion Generation and Control

Weiji Xie, Jiakun Zheng|arXiv (Cornell University)|Feb 7, 2026

Robotic Locomotion and Control被引用 0

一句话总结

TextOp 提供实时、文本驱动的全身 humanoid 运动生成与控制，通过耦合一个高级自回归扩散基运动生成器与一个低级跟踪策略，实现对真实机器人进行基于语言的交互式行为引导。

ABSTRACT

Recent advances in humanoid whole-body motion tracking have enabled the execution of diverse and highly coordinated motions on real hardware. However, existing controllers are commonly driven either by predefined motion trajectories, which offer limited flexibility when user intent changes, or by continuous human teleoperation, which requires constant human involvement and limits autonomy. This work addresses the problem of how to drive a universal humanoid controller in a real-time and interactive manner. We present TextOp, a real-time text-driven humanoid motion generation and control framework that supports streaming language commands and on-the-fly instruction modification during execution. TextOp adopts a two-level architecture in which a high-level autoregressive motion diffusion model continuously generates short-horizon kinematic trajectories conditioned on the current text input, while a low-level motion tracking policy executes these trajectories on a physical humanoid robot. By bridging interactive motion generation with robust whole-body control, TextOp unlocks free-form intent expression and enables smooth transitions across multiple challenging behaviors such as dancing and jumping, within a single continuous motion execution. Extensive real-robot experiments and offline evaluations demonstrate instant responsiveness, smooth whole-body motion, and precise control. The project page and the open-source code are available at https://text-op.github.io/

研究动机与目标

将基于交互式语言的意图表达与实时、物理可执行的人形控制相结合。
开发一个两层架构，从流文本中合成短时参考动作并在硬件上跟踪它们。
提出一种机器人骨架运动表示，以更好地与机器人运动学对齐。
在训练数据中加入生成器产生的动作，以减少数据与部署之间的分布差距。
展示真实机器人能力和离线评估，体现响应性、平滑性和精确控制。

提出的方法

高层自回归运动生成器 G 使用一个 VAE 加潜在扩散模型，基于历史和当前文本条件产生短时参考动作（T_future=8 帧）。
低层跟踪策略 π 是在仿真中训练的基于 MLP 的控制器，将参考动作转化为可执行的关节动作，频率为 50 Hz。
机器人骨架运动表示编码基于自由度的特征，包含根部朝向、偏航增量、接触、局部平移增量、高度、关节位置及其增量。
训练数据结合 AMASS 派生的再目标化动作和私有数据，并加入来自 BABEL 的语言注释，以及镜像增强与自我滚动策略以对齐分布。
在跟踪器训练中的数据增强：从文本流生成动作，以让跟踪器暴露在部署时的变动性中。
部署细节：实时文本输入通过 CLIP 编码，生成器在 GPU 上以 6.25 Hz 运行，跟踪器在机载端以 50 Hz 运行，采用网络通信并带有动作缓冲区。

实验结果

研究问题

RQ1TextOp 能否在真实人形机器人上实现精确、稳定、响应迅速的全身行为？
RQ2在互动场景中，运动生成器是否能够从文本命令生成高质量、语义对齐的动作？
RQ3运动跟踪策略是否能够鲁棒地执行多样化的参考动作，包括由生成器产生的动作？
RQ4机器人骨架运动表示和运动生成数据增强对部署鲁棒性有何益处？

主要发现

TextOp 在真实机器人实验中展现出即时响应、平滑的全身运动以及对多样技能的精确控制。
在长时 horizon 的 30 秒试验中，TextOp 以高跟踪保真度、强成功率和低跟踪误差维持稳定，在随机与结构化命令流下表现良好。
从命令到机器人响应的实时交互延迟平均 0.73 秒，生成延迟约 29.6 ms，跟踪延迟约 2.15 ms。
与基线相比，机器人骨架表示提升了生成质量与转换平滑度；在跟踪器训练中加入生成器产生的动作可改善部署对齐。
离线评估表明，结合 TextOp 的方法（M+G）在生成器产生数据上实现了鲁棒跟踪，而纯生成器训练的跟踪器对未见动作数据的泛化较弱。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。