Skip to main content
QUICK REVIEW

[論文レビュー] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline

Guo Chen, Lidong Lü|arXiv (Cornell University)|Mar 5, 2026
Domain Adaptation and Few-Shot Learning被引用数 0
ひとこと要約

Introduces MM-Lifelong, a multimodal lifelong dataset with day/week/month scales to study long-horizon understanding, and proposes ReMA, a recursive multimodal agent that overcomes memory bottlenecks in lifelong streams.

ABSTRACT

While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.

研究の動機と目的

  • Define the Lifelong Horizon for multimodal understanding by distinguishing Observational Duration from Physical Temporal Span.
  • Create MM-Lifelong, a multi-scale dataset with diverse domains to mimic long-term, sparse, real-world streams.
  • Characterize failure modes of end-to-end MLLMs and agentic baselines on lifelong timelines.
  • Propose ReMA, a recursive memory-based agent, to manage dynamic memory and improve long-horizon reasoning.
  • Provide a standardized train/val/test protocol to enable robust evaluation and generalization under temporal/domain shifts.

提案手法

  • Formalize lifelong multimodal task with two new temporal metrics (T_dur, T_span) and a Lifelong Horizon definition.
  • Construct MM-Lifelong with Day/Week/Month domains totaling 181.1 hours to simulate continuous lifespans with varying sparsity.
  • Annotate data using clue-grounded annotations to enable ground-truth temporal localization and two task types (Needle-in-a-Lifestream, Multi-Hop Reasoning).
  • Introduce Recursive Multimodal Agent (ReMA) that builds a language-augmented belief state via a two-phase loop: perception to memory, then recursive reasoning with memory-driven control actions (Answer, MMInspect, MemSearch).
  • Benchmark end-to-end MLLMs and agentic baselines, showing context bottlenecks, and demonstrating ReMA’s superior accuracy and grounding on lifelong tasks.
Figure 1 : Physical Temporal Span vs. Scale. The x-axis represents the Physical Temporal Span ( $T_{span}$ ), while bubble size indicates Observational Duration ( $T_{dur}$ ). Unlike existing datasets clustered in the bottom-left (short clips, $T_{span}\approx T_{dur}$ ), MM-Lifelong occupies the un
Figure 1 : Physical Temporal Span vs. Scale. The x-axis represents the Physical Temporal Span ( $T_{span}$ ), while bubble size indicates Observational Duration ( $T_{dur}$ ). Unlike existing datasets clustered in the bottom-left (short clips, $T_{span}\approx T_{dur}$ ), MM-Lifelong occupies the un

実験結果

リサーチクエスチョン

  • RQ1 How do current multimodal learners perform under Lifelong Horizon constraints with sparse temporal spans and domain shifts?
  • RQ2 Can a recursive, memory-augmented agent (ReMA) outperform end-to-end MLLMs on lifelong, multimodal streams?
  • RQ3 What are the effective memory update granularities and reasoning depths for sustaining performance across days to weeks?
  • RQ4 Does clue-grounded annotation enable robust evaluation and grounding at multiple temporal resolutions?
  • RQ5 To what extent can different backbone models (controller and MLLM tool) impact lifelong reasoning and grounding?

主な発見

ModelFramesVal@Month AccVal@Month Ref@300Test@Week AccTest@Week Ref@300Test@Day AccTest@Day Ref@300
ReMAFull18.6215.4618.8216.3716.7511.51
  • End-to-end MLLMs exhibit a Working Memory Bottleneck as context grows, with performance saturating or degrading.
  • Agentic baselines relying on global video localization collapse under month-scale sparsity, while ReMA scales with recursion and dynamic memory.
  • ReMA achieves the highest accuracy on Val@Month (18.62%), and strongest grounding (Ref@300 of 16.37%) among evaluated methods on Month, Week, and Day sets.
  • Finer perception granularity (e.g., 2-minute Δt) improves accuracy and grounding; full-video granularity degrades performance due to noise and reasoning cost.
  • Using multimodal backbones (MLLMs) for perception and control yields better results than text-only controllers, illustrating the importance of multimodal alignment for lifecycle reasoning.
  • GPT-5 demonstrates high reliability as a judge for automatic evaluation (F1 ~99.4%).
Figure 2 : Performance Scaling Analysis. As the number of input frames increases, end-to-end MLLMs initially improve but soon exhibit performance oscillation and even sharp degradation due to context saturation and noise accumulation. In contrast, ReMA consistently scales with more recursion rounds,
Figure 2 : Performance Scaling Analysis. As the number of input frames increases, end-to-end MLLMs initially improve but soon exhibit performance oscillation and even sharp degradation due to context saturation and noise accumulation. In contrast, ReMA consistently scales with more recursion rounds,

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。