QUICK REVIEW

[論文レビュー] RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yumin Dai, Hongze Fu|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用数 0

ひとこと要約

RoboMMEは大規模ベンチマークとメモリ強化型VLAポリシーのファミリーを導入し、異なるメモリ表現（記号的、知覚的、再帰的）と統合戦略が、長期的なロボット操作の4つのメモリタイプ（時間的、空間的、対象、手続き的）に及ぶ性能にどのように影響するかを検討する。

ABSTRACT

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

研究の動機と目的

異なるメモリ要求（時間的、空間的、対象、手続き的）に跨る記憶依存型ロボット操作ポリシーを評価するための統一的かつ大規模なベンチマークを開発する。
共通のボト Backbone 上で、制御されたVLAフレームワーク内でメモリ表現と統合戦略を系統的に比較する。
タスク間で一般化するメモリ設計と、タスク特性がメモリの有効性に与える影響について洞察を提供する。

提案手法

RoboMMEを提案し、4つのスイート（Counting、Permanence、Reference、Imitation）に編成された16タスクを用いて時間的・空間的・対象・手続き的メモリを探る。
pi_0.5 backbone 上に構築された14のメモリ強化型VLA変種を構築し、記号的・知覚的・再帰的メモリ表現を探索する。
知覚メモリと再帰メモリの統合機構として、memory-as-context、memory-as-modulator、memory-as-expertの3つを実装する。
記号的メモリをモデル（Gemini、QwenVL、またはOracle）によって生成される簡易または地言 groundingサブゴールでグラウンド化する。
知覚メモリをトークンドロップまたはフレームサンプリングを用いて視覚トークンの連なりとして符号化し、統合機構と同期させる。
テスト時訓練（TTT）や再帰的メモリトランスフォーマー（RMT）を含む再帰的メモリアプローチを3つの統合戦略とともに使用する。
固定されたメモリ予算とマルチタスク訓練設定の下で評価し、タスク間の性能を比較する。

Figure 2 : Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes hist

実験結果

リサーチクエスチョン

RQ1どのメモリ表現（記号的、知覚的、再帰的）とどの統合戦略がRoboMMEタスク全体で最も強い性能を発揮するか？
RQ2メモリの有効性はタスク特性（運動中心、時間敏感、長期的、動的シーン）にどう依存するか？
RQ3記号的メモリだけでメモリ強化操作が賄えるか、それとも特定のタスクには知覚／再帰メモリが必要か？
RQ4人間はRoboMMEの性能にどれくらい近づくか、どのエラーが残されたギャップを露呈するか？

主な発見

知覚メモリ方式は一般にタスク全体で最高の総合性能をもたらし、FrameSamp + Modulが変種間で最も強い平均を達成する。
記号的メモリは特定のカウントおよびグラウンディングタスクで競争力があるが、正確な grounding がないと操作が多い、または cluttered なシーンでは困難。
再帰メモリは本設定では一般的に効果が低く、堅牢な長期推論にはより深い再帰性や事前学習の改善が必要であることを示唆。
memory-as-modulatorを介した統合は、最小限の構造的破壊で行動経路を条件付けることで知覚メモリに対する大きな利得を提供する。
単一のメモリ表現がすべてのタスクを支配するわけではなく、タスク依存かつ補完的であり、ハイブリッドアプローチの潜在的利益を示唆。
人間は高い成功を収めるが、長期的および時間敏感な課題には依然として困難があり、RoboMMEの難易度と堅牢なメモリ強化ポリシーの要請を強調する。

Figure 3 : Performance comparison across task characteristics.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。