QUICK REVIEW

[论文解读] Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

Yipeng Chen, Wentao Tan|arXiv (Cornell University)|Mar 2, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

引入 Keyframe-Chaining VLA，利用自动关键帧选择器与基于进度的查询，为长时程、非马尔可夫机器人操作任务创建稀疏语义历史，在 ManiSkill 基准及真实世界部署中达到最新性能水平。

ABSTRACT

Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.

研究动机与目标

推动并解决非马尔可夫、长时程机器人操控问题，其中即时观测不足以单独推断行动。
开发一个轻量级的 Keyframe Selection Module (KSM) 以提取具有辨别性的语义关键帧。
将稀疏关键帧整合到 Vision-Language-Action 策略中，以支撑长时程上下文。
提出一个任务调制的 FiLM 基查询机制以实现准确的关键帧检索。
建立基于 ManiSkill 的长时程记忆基准并在仿真与真实世界中验证性能。

提出的方法

两阶段的 Keyframe Selection Module (KSM)，通过跨阶段与跨任务的三元组损失学习辨识性视觉嵌入。
阶段 II 使用带 FiLM 的 Task-Modulated Query 网络以产生阶段感知的线索，并通过 Cross-Attention 检索关键帧。
贪心式时间平滑以稳定关键帧检测并鲁棒地最终确定里程碑。
VLA 主干 (GR00T-N1.5) 重新表述为通过结构化系统提示来消费 Sparse Semantic History ϕ{o_k1,...,o_kn,o_t}。
训练采用解耦的两阶段制：先进行嵌入的度量学习，再进行里程碑检测的查询训练。
该方法在一个新的基于 ManiSkill 的长时程记忆基准上进行评估，并在 Piper 机械臂的真实世界实验中验证。

实验结果

研究问题

RQ1稀疏语义关键帧是否比密集历史在非马尔可夫任务中更能捕捉长时程依赖？
RQ2KSM 在跨任务和跨剧本的语义里程碑检测上有多大有效性？
RQ3将关键帧融入 VLA 策略是否提升记忆依赖型操控任务在仿真和真实世界中的表现？
RQ4提示设计与训练范式对里程碑检测与整体策略表现有何影响？

主要发现

Model / Configuration	Sampling	Nh	I	Spatial	Temporal	Identity	Counting	Average
π0 (Black et al., 2024)	Dense	0	-	2.0	0.0	0.0	60.0	15.5
Diffusion Policy (Chi et al., 2025)	1	1	22.0	10.0	0.0	30.0	15.5
GR00T-N1.5 (Bjorck et al., 2025) (No History)	0	-	20.0	0.0	28.0	16.0	16.0
GR00T-N1.5 (Short-term)	Dense	1	1	8.0	16.0	30.0	4.0	14.5
GR00T-N1.5 (Long-term)	Fixed Stride	3	5	20.0	80.0	32.0	30.0	40.5
Keyframe-Chaining VLA (ours)	Keyframes	-	-	70.0	98.0	100.0	100.0	92.0

Keyframe-Chaining VLA 在提出的 ManiSkill 长时程任务上实现 92.0% 的平均成功率，优于基线（如 57%）。
密集短时程基线在非马尔可夫任务上普遍低于 30% 的成功率，而固定步长历史在不同任务动态下会降低表现。
两阶段 KSM 结合度量学习与上下文感知提示可达到高里程碑精确率/召回率（P 97.5%、R 97.5%、F1 97.5%），假阳性率和假阴性率均低（各 2.5%）。
提示精炼与上下文感知提示显著提升性能，尤其在 Spatial Reconfiguration (56% 提升至 70%)。
在真实世界实验中，Keyframe-Chaining VLA 在 Spatial、Temporal、Counting 与 Identity 任务上的完成率和成功率均高于 Diffusion Policy 与 GR00T 基线（例如 Counting：80% SR, 90% CR）。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。