QUICK REVIEW

[논문 리뷰] RoboMME: Benchmarking and Understanding Memory for Robotic Generalist Policies

Yumin Dai, Hongze Fu|arXiv (Cornell University)|2026. 03. 04.

Multimodal Machine Learning Applications인용 수 0

한 줄 요약

RoboMME은 대규모 벤치마크와 메모리 증가 VLA 정책 패밀리를 도입하여 다양한 메모리 표현(상징적, 지각적, 순환적)과 통합 전략이 긴 기간의 로봇 조작에서 네 가지 메모리 유형(시간적, 공간적, 객체적, 절차적)에 걸쳐 어떻게 성능을 발휘하는지 연구한다.

ABSTRACT

Memory is critical for long-horizon and history-dependent robotic manipulation. Such tasks often involve counting repeated actions or manipulating objects that become temporarily occluded. Recent vision-language-action (VLA) models have begun to incorporate memory mechanisms; however, their evaluations remain confined to narrow, non-standardized settings. This limits their systematic understanding, comparison, and progress measurement. To address these challenges, we introduce RoboMME: a large-scale standardized benchmark for evaluating and advancing VLA models in long-horizon, history-dependent scenarios. Our benchmark comprises 16 manipulation tasks constructed under a carefully designed taxonomy that evaluates temporal, spatial, object, and procedural memory. We further develop a suite of 14 memory-augmented VLA variants built on the π0.5 backbone to systematically explore different memory representations across multiple integration strategies. Experimental results show that the effectiveness of memory representations is highly task-dependent, with each design offering distinct advantages and limitations across different tasks. Videos and code can be found at our website https://robomme.github.io.

연구 동기 및 목표

통합적이고 대규모의 벤치마크를 개발하여 서로 다른 메모리 요구(시간적, 공간적, 객체적, 절차적)에 따른 메모리 의존 로봇 조작 정책을 평가한다.
공통 백본에서 제어된 VLA 프레임워크 내에서 메모리 표현과 통합 전략을 체계적으로 비교한다.
작업 간 일반화가 가능한 메모리 설계와 작업 특성이 메모리 효과에 어떤 영향을 주는지에 대한 통찰을 제공한다.

제안 방법

시간적, 공간적, 객체적, 절차적 메모리를 탐구하기 위해 Counting, Permanence, Reference, Imitation의 네 가지 묶음으로 구성된 16개 작업의 RoboMME를 제안한다.
symbolic, perceptual, recurrent 메모리 표현을 탐구하기 위해 pi_0.5 백본에 구축된 14개의 메모리 증강 VLA 변형을 구성한다.
지각 메모리와 순환 메모리에 대해 메모리-에-맥락(memory-as-context), 메모리-에-조절자(memory-as-modulator), 메모리-에-전문가(memory-as-expert)라는 세 가지 통합 메커니즘을 구현한다.
모델(Gemini, QwenVL, 또는 Oracle)이 생성한 간단한 언어 하위 목표나 근거 있는 하위 목표를 통해 상징적 메모리를 접지한다.
토큰 드롭핑이나 프레임 샘플링을 사용하여 시각 토큰의 시퀀스로 지각 메모리를 인코딩하고 통합 메커니즘과 동기화한다.
세 가지 통합 전략을 가진 테스트 타임 트레이닝(TTT) 및 순환 메모리 트랜스포머(RMT)를 포함한 순환 메모리 방식들을 사용한다.
고정된 메모리 예산과 다중 작업 학습 설정에서 평가하여 작업 간 성능을 비교한다.

Figure 2 : Framework of MME-VLA Suite. The top part illustrates three memory representations, each with two instantiations: (1) Symbolic Memory summarizes past interactions as high-level abstractions via language-based subgoals, optionally grounded to image pixels; (2) Perceptual Memory encodes hist

실험 결과

연구 질문

RQ1어떤 메모리 표현(상징적, 지각적, 순환적)과 어떤 통합 전략이 RoboMME 과제들에서 가장 높은 성능을 낳는가?
RQ2메모리 효과는 작업 특성(모션 중심, 시간에 민감한, 긴 시점, 동적 장면)에 따라 어떻게 달라지는가?
RQ3상징적 메모리만으로도 메모리 증강 조작이 충분한가, 아니면 특정 작업에 지각/순환 메모리가 필요한가?
RQ4인간은 RoboMME 성능에 얼마나 근접하며 남은 간극을 드러내는 오류는 무엇인가?

주요 결과

지각적 메모리 방식이 일반적으로 작업 전반에서 가장 높은 성능을 보이며, FrameSamp + Modul이 변형들 가운데 가장 강한 평균을 달성한다.
상징적 메모리는 특정 세기(counting) 및 바인딩/그 grounding 태스크에서 경쟁력이 있지만, 조작이 많은 장면이나 복잡한 환경에서 정밀한 grounding 없이는 어려움을 겪는다.
순환 메모리는 이 설정에서 일반적으로 덜 효과적이며, 견고한 장기 추론을 위해서는 더 깊은 순환성이나 더 나은 사전학습이 필요함을 시사한다.
메모리-에-조절자 통합이 지각 메모리에 대해 행동 경로를 최소한의 구조적 교란으로 조건화하여 강한 이점을 제공한다.
하나의 메모리 표현이 모든 작업을 지배하지 않으며, 강점은 작업에 따라 달라지고 상호 보완적이어서 하이브리드 접근의 잠재적 이점을 시사한다.
인간은 높은 성공률을 거두지만 여전히 긴 시점과 시간 민감한 도전에 직면하여 RoboMME의 난이도와 강력한 메모리 증강 정책의 필요성을 강조한다.

Figure 3 : Performance comparison across task characteristics.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.