QUICK REVIEW

[论文解读] InCoM: Intent-Driven Perception and Structured Coordination for Whole-Body Mobile Manipulation

Jiahao Liu, Cui Wenbo|arXiv (Cornell University)|Feb 26, 2026

Robot Manipulation and Learning被引用 0

一句话总结

InCoM 是一个端到端的全身移动操作框架，联合建模意图驱动感知与基座–机械臂的双向协同，在感知受限条件下在 ManiSkill-HAB 任务中获得更高的成功率。

ABSTRACT

Whole-body mobile manipulation is a fundamental capability for general-purpose robotic agents, requiring both coordinated control of the mobile base and manipulator and robust perception under dynamically changing viewpoints. However, existing approaches face two key challenges: strong coupling between base and arm actions complicates whole-body control optimization, and perceptual attention is often poorly allocated as viewpoints shift during mobile manipulation. We propose InCoM, an intent-driven perception and structured coordination framework for whole-body mobile manipulation. InCoM infers latent motion intent to dynamically reweight multi-scale perceptual features, enabling stage-adaptive allocation of perceptual attention. To support robust cross-modal perception, InCoM further incorporates a geometric-semantic structured alignment mechanism that enhances multimodal correspondence. On the control side, we design a decoupled coordinated flow matching action decoder that explicitly models coordinated base-arm action generation, alleviating optimization difficulties caused by control coupling. Without access to privileged perceptual information, InCoM outperforms state-of-the-art methods on three ManiSkill-HAB scenarios by 28.2%, 26.1%, and 23.6% in success rate, demonstrating strong effectiveness for whole-body mobile manipulation.

研究动机与目标

在动态视角下推动基座与臂部高度耦合控制的鲁棒全身移动操作。
通过推断运动意图来实现阶段感知，以自适应多尺度感知特征。
通过几何-语义对齐实现鲁棒的跨模态融合。
在动作解码过程中建模移动基座与操作臂之间的双向协调。
在没有特权感知信息的情况下，在 ManiSkill-HAB 场景中展示更高的任务成功率。

提出的方法

Intent-Driven Pyramid Perception Module (IDPPM) 从历史动作和全局上下文推断潜在运动意图，以为阶段感知重新加权多尺度感知特征。
Dual-stream Affinity Refinement Module (DARM) 将几何亲和力与语义亲和力解耦，以增强来自3D点云与2D图像的跨模态融合，并进行几何引导的注意力正则化。
Decoupled Coordinated Flow Matching (DCFM) 使用带有基座与臂解码器之间双向交叉注意的条件流匹配，生成协调的全身动作。
一个统一的端到端目标函数，结合流匹配损失、来自意图的尺度正则化以及几何感知对齐损失。
该框架在一个 POMDP 形式下运行，动作分为基座和臂部分量，并且不依赖于特权感知信息。

实验结果

研究问题

RQ1如何推断潜在运动意图以在全身移动操作中跨任务阶段自适应感知注意力？
RQ2是否可以在端到端框架中有效建模基座与臂的双向协调以提高稳定性和任务成功率？
RQ3带有明确几何与语义亲和力的解耦跨模态融合是否在动态视角下提升感知？
RQ4多尺度感知表示与阶段感知加权对操作与导航性能有何影响？
RQ5在 ManiSkill-HAB 的感知受限设置下，InCoM 与最先进方法相比的表现如何？

主要发现

Method	Pick Apple	Place Apple	Open Fridge	Pick Bowl	Place Bowl	Open Drawer	Close Drawer	Mean
DP (Chi et al., 2024)	0.5	54.5	63.0	2.1	63.5	5.3	89.4	39.8
ACT (Zhao et al., 2023)	1.6	21.2	74.6	9.0	21.7	48.1	91.5	38.2
WB-VIMA (Jiang et al., 2025)	1.6	57.7	27.0	1.6	60.3	5.3	87.3	34.4
DSPv2 (Su et al., 2025)	1.4	65.2	73.4	1.4	85.7	29.9	98.4	50.8
AC-DiT (Chen et al., 2025)	33.3	33.3	90.7	36.0	17.3	81.3	97.3	55.6
InCoM (Ours)	59.4	84.1	87.3	84.1	82.5	88.9	100	83.8

InCoM 在 ManiSkill-HAB 的三个场景下以 28.2%、26.1%、23.6% 的成功率提升超越最先进基线。
消融实验显示移除 IDPPM 或跨模态组件会显著降低平均成功率，完整模型达到 83.8% 的平均成功率。
IDPPM 实现对全球与局部特征的阶段自适应分配，使感知与任务阶段保持对齐。
DARM 通过分别建模几何与语义亲和力并应用几何引导正则化，实现鲁棒的跨模态对齐。
DCFM 通过带有跨注意力的并行解码器与停止梯度实现基座–臂的双向协调，稳定学习过程。
具有意图驱动加权的多尺度感知对处理动态视角和跨导航与操作的感知需求至关重要。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。