[论文解读] RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies
tldr: RoboMME introduces a large-scale benchmark and a family of memory-augmented VLA policies to study how different memory representations (symbolic, perceptual, recurrent) and integration strategies perform across four memory types (temporal, spatial, object, procedural) in long-horizon robotic manipulation.
Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.
研究动机与目标
- Develop a unified, large-scale benchmark to evaluate memory-dependent robotic manipulation policies across distinct memory demands (temporal, spatial, object, procedural).
- Systematically compare memory representations and integration strategies within a controlled VLA framework on a common backbone.
- Provide insights into which memory designs generalize across tasks and how task characteristics influence memory effectiveness.
提出的方法
- Propose RoboMME with 16 tasks organized into four suites (Counting, Permanence, Reference, Imitation) to probe temporal, spatial, object, and procedural memory.
- Construct 14 memory-augmented VLA variants built on the pi_0.5 backbone to explore symbolic, perceptual, and recurrent memory representations.
- Implement three integration mechanisms for perceptual and recurrent memory: memory-as-context, memory-as-modulator, and memory-as-expert.
- Ground symbolic memory via simple or grounded language subgoals generated by models (Gemini, QwenVL, or Oracle).
- Encode perceptual memory as sequences of visual tokens using token dropping or frame sampling; synchronize with integration mechanisms.
- Use recurrent memory approaches including test-time training (TTT) and recurrent memory transformers (RMT) with three integration strategies.
- Evaluate under a fixed memory budget and multi-task training setup to compare performance across tasks.

实验结果
研究问题
- RQ1Which memory representations (symbolic, perceptual, recurrent) and which integration strategies yield the strongest performance across RoboMME tasks?
- RQ2How does memory effectiveness depend on task characteristics (motion-centric, time-sensitive, long-horizon, dynamic scenes)?
- RQ3Can symbolic memory alone suffice for memory-augmented manipulation, or are perceptual/recurrent memories necessary for certain tasks?
- RQ4How close do humans come to RoboMME performance, and what errors reveal remaining gaps?
主要发现
- Perceptual memory methods generally yield the highest overall performance across tasks, with FrameSamp + Modul achieving the strongest average across variants.
- Symbolic memory can be competitive on certain counting and grounding tasks, but struggles on manipulation-heavy or cluttered scenes without precise grounding.
- Recurrent memory is generally less effective in this setup, suggesting that deeper recurrence or better pretraining may be required for robust long-horizon reasoning.
- Memory integration via memory-as-modulator provides strong gains for perceptual memory by conditioning the action pathway with minimal architectural disruption.
- No single memory representation dominates all tasks; strengths are task-dependent and complementary, indicating potential benefits from hybrid approaches.
- Humans achieve high success but still face long-horizon and time-sensitive challenges, underscoring RoboMME’s difficulty and its demand for robust memory-augmented policies.

更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。