QUICK REVIEW

[論文レビュー] Non-Markovian Long-Horizon Robot Manipulation via Keyframe Chaining

Yipeng Chen, Wentao Tan|arXiv (Cornell University)|Mar 2, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

Keyframe-Chaining VLAは自動キーフレームセレクターと進捗認識クエリを用いて、長期・非マルコフ的なロボット操作タスクの疎な意味履歴を作成し、ManiSkillベースのベンチマークと実世界展開で最先端の結果を達成する。

ABSTRACT

Existing Vision-Language-Action (VLA) models often struggle to generalize to long-horizon tasks due to their heavy reliance on immediate observations. While recent studies incorporate retrieval mechanisms or extend context windows to handle procedural tasks, they often struggle to capture Non-Markovian dependencies, where optimal actions rely solely on specific past states rather than the current observation. To address this, we introduce Keyframe-Chaining VLA, a framework that extracts and links key historical frames to model long-horizon dependencies. Specifically, we propose an automatic keyframe selector that learns a discriminative embedding space, effectively identifying distinct state transitions. To capture task-critical information, we design a progress-aware query mechanism that dynamically retrieves historical frames based on their temporal relevance to the current execution phase. These selected keyframes are integrated into the VLA as interleaved visual tokens, explicitly grounding the policy in the long-horizon temporal context. Finally, we introduce a suite of four Non-Markovian manipulation tasks built upon the ManiSkill simulator to measure task success rates. Experimental results demonstrate that our method achieves superior performance, effectively tackling robot manipulation tasks characterized by long-horizon temporal dependencies. Code is available at https://github.com/cytoplastm/KC-VLA.

研究の動機と目的

immediate observationsだけでは不十分な非マルコフ性・長期的なロボット操作を動機づけ、解決する。
discriminative semantic keyframesを抽出する軽量なKeyframe Selection Module (KSM)を開発する。
疎なキーフレームをVision-Language-Actionポリシーに統合し、長期コンテキストを地固めする。
accurateなキーフレーム検索のためのタスクモジューレートFiLMベースのクエリ機構を提案する。
ManiSkillベースの長期メモリベンチマークを確立し、シミュレーションと実世界の両方で性能を検証する。

提案手法

2段階のKeyframe Selection Module (KSM)が、フェーズとタスクを横断するトリプレット損失で識別的な視覚埋め込みを学習する。
Stage IIはFiLMを用いたTask-Modulated Queryネットワークでフェーズ認識クーとCross-Attentionを用いてキーフレームを取得する。
Greedyな時系列平滑化でキーフレーム検出を安定化させ、マイルストーンを堅牢に確定する。
VLAバックボーン（GR00T-N1.5）を、Sparse Semantic History ￮{o_k1,...,o_kn,o_t}を構造化システムプロンプトで消費するよう再定式化。
トレーニングはデカップリングされた2段階制度で行い、埋め込みのメトリック学習とマイルストーン検出のクエリ学習に分離する。
本手法は新たなManiSkillベースの長期メモリベンチマークと、Piperロボットアームを用いた実世界実験で評価される。

実験結果

リサーチクエスチョン

RQ1非マルコフなタスクにおいて、疎な意味キーフレームは密な履歴より長期依存をよりよく捉えられるか。
RQ2複数タスク・エピソードでKSMは意味的マイルストーンを検出するのにどれほど効果的か。
RQ3キーフレームをVLAポリシーに組み込むと、シミュレーションと現実世界の記憶依存型操作タスクの性能は向上するか。
RQ4プロンプト設計とトレーニングパラダイムがマイルストーン検出と全体的なポリシー性能に与える影響は何か。

主な発見

Model / Configuration	Sampling	Nh	I	Spatial	Temporal	Identity	Counting	Average
π0 (Black et al., 2024)	Dense	0	-	2.0	0.0	0.0	60.0	15.5
Diffusion Policy (Chi et al., 2025)	1	1	22.0	10.0	0.0	30.0	15.5
GR00T-N1.5 (Bjorck et al., 2025) (No History)	0	-	20.0	0.0	28.0	16.0	16.0
GR00T-N1.5 (Short-term)	Dense	1	1	8.0	16.0	30.0	4.0	14.5
GR00T-N1.5 (Long-term)	Fixed Stride	3	5	20.0	80.0	32.0	30.0	40.5
Keyframe-Chaining VLA (ours)	Keyframes	-	-	70.0	98.0	100.0	100.0	92.0

Keyframe-Chaining VLAは提案したManiSkill長期タスクで平均成功率92.0%を達成し、ベースライン（例：57%）を上回る。
Denseな短期履歴ベースラインは非マルコフ性タスクで全体として30%未満の成功率に留まり、固定間隔の履歴はタスクダイナミクスに応じて性能を低下させる。
2段階KSMとメトリック学習、文脈認識プロンプトによりマイルストーンの精度/再現率が高く（P 97.5%、R 97.5%、F1 97.5%）、FPRおよびFNRは各々2.5%と低い。
プロンプトの洗練と文脈認識プロンプティングは性能を大幅に向上させ、Spatial再構成で特に顕著（56%→70%）。
実世界実験ではKeyframe-Chaining VLAがSpatial、Temporal、Counting、IdentityタスクでDiffusion PolicyおよびGR00Tベースラインより高い完了率と成功率を達成（例：Counting: 80% SR, 90% CR）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。