QUICK REVIEW

[论文解读] RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yumin Dai, Hongze Fu|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

tldr: RoboMME introduces a large-scale benchmark and a family of memory-augmented VLA policies to study how different memory representations (symbolic, perceptual, recurrent) and integration strategies perform across four memory types (temporal, spatial, object, procedural) in long-horizon robotic manipulation.

ABSTRACT

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

研究动机与目标

Develop a unified, large-scale benchmark to evaluate memory-dependent robotic manipulation policies across distinct memory demands (temporal, spatial, object, procedural).
Systematically compare memory representations and integration strategies within a controlled VLA framework on a common backbone.
Provide insights into which memory designs generalize across tasks and how task characteristics influence memory effectiveness.

提出的方法

Propose RoboMME with 16 tasks organized into four suites (Counting, Permanence, Reference, Imitation) to probe temporal, spatial, object, and procedural memory.
Construct 14 memory-augmented VLA variants built on the pi_0.5 backbone to explore symbolic, perceptual, and recurrent memory representations.
Implement three integration mechanisms for perceptual and recurrent memory: memory-as-context, memory-as-modulator, and memory-as-expert.
Ground symbolic memory via simple or grounded language subgoals generated by models (Gemini, QwenVL, or Oracle).
Encode perceptual memory as sequences of visual tokens using token dropping or frame sampling; synchronize with integration mechanisms.
Use recurrent memory approaches including test-time training (TTT) and recurrent memory transformers (RMT) with three integration strategies.
Evaluate under a fixed memory budget and multi-task training setup to compare performance across tasks.

Figure 2 : Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes hist

实验结果

研究问题

RQ1Which memory representations (symbolic, perceptual, recurrent) and which integration strategies yield the strongest performance across RoboMME tasks?
RQ2How does memory effectiveness depend on task characteristics (motion-centric, time-sensitive, long-horizon, dynamic scenes)?
RQ3Can symbolic memory alone suffice for memory-augmented manipulation, or are perceptual/recurrent memories necessary for certain tasks?
RQ4How close do humans come to RoboMME performance, and what errors reveal remaining gaps?

主要发现

Perceptual memory methods generally yield the highest overall performance across tasks, with FrameSamp + Modul achieving the strongest average across variants.
Symbolic memory can be competitive on certain counting and grounding tasks, but struggles on manipulation-heavy or cluttered scenes without precise grounding.
Recurrent memory is generally less effective in this setup, suggesting that deeper recurrence or better pretraining may be required for robust long-horizon reasoning.
Memory integration via memory-as-modulator provides strong gains for perceptual memory by conditioning the action pathway with minimal architectural disruption.
No single memory representation dominates all tasks; strengths are task-dependent and complementary, indicating potential benefits from hybrid approaches.
Humans achieve high success but still face long-horizon and time-sensitive challenges, underscoring RoboMME’s difficulty and its demand for robust memory-augmented policies.

Figure 3 : Performance comparison across task characteristics.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。