Skip to main content
QUICK REVIEW

[论文解读] RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yumin Dai, Hongze Fu|arXiv (Cornell University)|Mar 4, 2026
Multimodal Machine Learning Applications被引用 0
一句话总结

tldr: RoboMME introduces a large-scale benchmark and a family of memory-augmented VLA policies to study how different memory representations (symbolic, perceptual, recurrent) and integration strategies perform across four memory types (temporal, spatial, object, procedural) in long-horizon robotic manipulation.

ABSTRACT

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

研究动机与目标

  • Develop a unified, large-scale benchmark to evaluate memory-dependent robotic manipulation policies across distinct memory demands (temporal, spatial, object, procedural).
  • Systematically compare memory representations and integration strategies within a controlled VLA framework on a common backbone.
  • Provide insights into which memory designs generalize across tasks and how task characteristics influence memory effectiveness.

提出的方法

  • Propose RoboMME with 16 tasks organized into four suites (Counting, Permanence, Reference, Imitation) to probe temporal, spatial, object, and procedural memory.
  • Construct 14 memory-augmented VLA variants built on the pi_0.5 backbone to explore symbolic, perceptual, and recurrent memory representations.
  • Implement three integration mechanisms for perceptual and recurrent memory: memory-as-context, memory-as-modulator, and memory-as-expert.
  • Ground symbolic memory via simple or grounded language subgoals generated by models (Gemini, QwenVL, or Oracle).
  • Encode perceptual memory as sequences of visual tokens using token dropping or frame sampling; synchronize with integration mechanisms.
  • Use recurrent memory approaches including test-time training (TTT) and recurrent memory transformers (RMT) with three integration strategies.
  • Evaluate under a fixed memory budget and multi-task training setup to compare performance across tasks.
Figure 2 : Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes hist
Figure 2 : Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes hist

实验结果

研究问题

  • RQ1Which memory representations (symbolic, perceptual, recurrent) and which integration strategies yield the strongest performance across RoboMME tasks?
  • RQ2How does memory effectiveness depend on task characteristics (motion-centric, time-sensitive, long-horizon, dynamic scenes)?
  • RQ3Can symbolic memory alone suffice for memory-augmented manipulation, or are perceptual/recurrent memories necessary for certain tasks?
  • RQ4How close do humans come to RoboMME performance, and what errors reveal remaining gaps?

主要发现

  • Perceptual memory methods generally yield the highest overall performance across tasks, with FrameSamp + Modul achieving the strongest average across variants.
  • Symbolic memory can be competitive on certain counting and grounding tasks, but struggles on manipulation-heavy or cluttered scenes without precise grounding.
  • Recurrent memory is generally less effective in this setup, suggesting that deeper recurrence or better pretraining may be required for robust long-horizon reasoning.
  • Memory integration via memory-as-modulator provides strong gains for perceptual memory by conditioning the action pathway with minimal architectural disruption.
  • No single memory representation dominates all tasks; strengths are task-dependent and complementary, indicating potential benefits from hybrid approaches.
  • Humans achieve high success but still face long-horizon and time-sensitive challenges, underscoring RoboMME’s difficulty and its demand for robust memory-augmented policies.
Figure 3 : Performance comparison across task characteristics.
Figure 3 : Performance comparison across task characteristics.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。