QUICK REVIEW

[논문 리뷰] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Guo Chen, Lidong Lü|arXiv (Cornell University)|2026. 03. 05.

Domain Adaptation and Few-Shot Learning인용 수 0

한 줄 요약

소개 MM-Lifelong은 긴 수평 이해를 연구하기 위한 일/주/월 규모의 멀티모달 Lifelong 데이터셋을 제시하고, lifelong 스트림의 기억 병목을 극복하는 재귀적 멀티모달 에이전트 ReMA를 제안한다.

ABSTRACT

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

연구 동기 및 목표

다중 모달 이해를 위한 Lifelong Horizon 정의: 관찰 기간(Observational Duration)과 물리적 시간 범위(Physical Temporal Span)를 구분한다.
다양한 도메인을 갖춘 다중 스케일 데이터셋 MM-Lifelong을 만들어 장기적이고 드문 real-world 스트림을 모방한다.
Lifelong 타임라인에서 엔드투엔드 MLLMs와 에이전트 기반 베이스라인의 실패 모드를 특징화한다.
동적 메모리를 관리하고 장기 수평 추론을 개선하기 위해 재귀적 메모리 기반 에이전트(ReMA)를 제안한다.
시간/도메인 변화에 대해 견고한 평가와 일반화를 가능하게 하는 표준화된 학습/검증/테스트 프로토콜을 제공한다.

제안 방법

두 가지 새로운 시간 지표(T_dur, T_span)와 Lifelong Horizon 정의를 포함하는 Lifelong 멀티모달 작업을 형식화한다.
Day/Week/Month 도메인으로 총 181.1시간의 MM-Lifelong을 구성하여 다양한 희소성을 가진 연속적 수명을 시뮬레이션한다.
단서 기반 주석(clue-grounded annotations)을 사용해 실제 시간 위치를 가능하게 하고 두 가지 작업 유형(Needle-in-a-Lifestream, Multi-Hop Reasoning)을 가능하게 한다.
언어가 보강된 신념 상태를 구성하는 두 단계 루프를 갖는 Recursive Multimodal Agent(ReMA)를 도입한다: 인식에서 기억으로, 기억 기반 제어 행동(Answer, MMInspect, MemSearch)으로의 재귀적 추론.
맥락 병목 현상을 보여주며 Lifelong 태스크에서 ReMA의 더 높은 정확도와 근거 제시를 입증하기 위해 엔드-투-엔드 MLLMs와 에이전트 베이스라인을 벤치마크 한다.

Figure 1 : Physical Temporal Span vs. Scale. The x-axis represents the Physical Temporal Span ( $T_{span}$ ), while bubble size indicates Observational Duration ( $T_{dur}$ ). Unlike existing datasets clustered in the bottom-left (short clips, $T_{span}\approx T_{dur}$ ), MM-Lifelong occupies the un

실험 결과

연구 질문

RQ1현 시점의 다중 모달 학습자들이 희소한 시간 구간과 도메인 변화가 있는 Lifelong Horizon 제약 하에서 어떻게 수행하는가?
RQ2재귀적이고 메모리 보강된 에이전트(ReMA)가 Lifelong, 멀티모달 스트림에서 엔드 투 엔드 MLLMs보다 성능이 우수한가?
RQ3수일에서 주까지 성능을 유지하기 위한 효과적인 메모리 업데이트 세분성과 추론 깊이는 무엇인가?
RQ4단서 기반 주석이 다중 시간 해상도에서 견고한 평가 및 근거 제시에 기여하는가?
RQ5다른 백본 모델(컨트롤러 및 MLLM 도구)이 Lifelong 추론과 근거 제시에 미치는 영향은 어느 정도인가?

주요 결과

엔드투엔드 MLLMs는 컨텍스트가 커질수록 Working Memory Bottleneck을 보이며 성능이 포화되거나 저하된다.
전역 비디오 위치에 의존하는 에이전트 베이스라인은 월간 규모의 희소성에서 붕괴하는 반면, ReMA는 재귀와 동적 메모리로 확장한다.
ReMA는 Month에서 Val@Month 18.62%, Week/Day 세트에서 가장 높은 정확도 및 강력한 근거 제시를 달성한다(Ref@300 16.37%).
더 촘촘한 인지 해상도(예: 2분 Δt)가 정확도와 근거 제시에 도움이 되며, 전체 비디오 해상도는 노이즈와 추론 비용으로 성능이 저하된다.
인식과 제어에 대해 멀티모달 백본(MLLMs)을 사용하는 것이 텍스트 전용 컨트롤러보다 더 좋은 결과를 낳아 lifecycle 추론에서 다중 모달 정렬의 중요성을 보여준다.
GPT-5는 자동 평가의 판사로서 높은 신뢰도(F1 약 99.4%)를 보여준다.

Figure 2 : Performance Scaling Analysis. As the number of input frames increases, end-to-end MLLMs initially improve but soon exhibit performance oscillation and even sharp degradation due to context saturation and noise accumulation. In contrast, ReMA consistently scales with more recursion rounds,

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.