[論文レビュー] Towards Multimodal Lifelong Understanding: A Dataset and Agentic Baseline
Introduces MM-Lifelong, a multimodal lifelong dataset with day/week/month scales to study long-horizon understanding, and proposes ReMA, a recursive multimodal agent that overcomes memory bottlenecks in lifelong streams.
While datasets for video understanding have scaled to hour-long durations, they typically consist of densely concatenated clips that differ from natural, unscripted daily life. To bridge this gap, we introduce MM-Lifelong, a dataset designed for Multimodal Lifelong Understanding. Comprising 181.1 hours of footage, it is structured across Day, Week, and Month scales to capture varying temporal densities. Extensive evaluations reveal two critical failure modes in current paradigms: end-to-end MLLMs suffer from a Working Memory Bottleneck due to context saturation, while representative agentic baselines experience Global Localization Collapse when navigating sparse, month-long timelines. To address this, we propose the Recursive Multimodal Agent (ReMA), which employs dynamic memory management to iteratively update a recursive belief state, significantly outperforming existing methods. Finally, we establish dataset splits designed to isolate temporal and domain biases, providing a rigorous foundation for future research in supervised learning and out-of-distribution generalization.
研究の動機と目的
- Define the Lifelong Horizon for multimodal understanding by distinguishing Observational Duration from Physical Temporal Span.
- Create MM-Lifelong, a multi-scale dataset with diverse domains to mimic long-term, sparse, real-world streams.
- Characterize failure modes of end-to-end MLLMs and agentic baselines on lifelong timelines.
- Propose ReMA, a recursive memory-based agent, to manage dynamic memory and improve long-horizon reasoning.
- Provide a standardized train/val/test protocol to enable robust evaluation and generalization under temporal/domain shifts.
提案手法
- Formalize lifelong multimodal task with two new temporal metrics (T_dur, T_span) and a Lifelong Horizon definition.
- Construct MM-Lifelong with Day/Week/Month domains totaling 181.1 hours to simulate continuous lifespans with varying sparsity.
- Annotate data using clue-grounded annotations to enable ground-truth temporal localization and two task types (Needle-in-a-Lifestream, Multi-Hop Reasoning).
- Introduce Recursive Multimodal Agent (ReMA) that builds a language-augmented belief state via a two-phase loop: perception to memory, then recursive reasoning with memory-driven control actions (Answer, MMInspect, MemSearch).
- Benchmark end-to-end MLLMs and agentic baselines, showing context bottlenecks, and demonstrating ReMA’s superior accuracy and grounding on lifelong tasks.

実験結果
リサーチクエスチョン
- RQ1 How do current multimodal learners perform under Lifelong Horizon constraints with sparse temporal spans and domain shifts?
- RQ2 Can a recursive, memory-augmented agent (ReMA) outperform end-to-end MLLMs on lifelong, multimodal streams?
- RQ3 What are the effective memory update granularities and reasoning depths for sustaining performance across days to weeks?
- RQ4 Does clue-grounded annotation enable robust evaluation and grounding at multiple temporal resolutions?
- RQ5 To what extent can different backbone models (controller and MLLM tool) impact lifelong reasoning and grounding?
主な発見
| Model | Frames | Val@Month Acc | Val@Month Ref@300 | Test@Week Acc | Test@Week Ref@300 | Test@Day Acc | Test@Day Ref@300 |
|---|---|---|---|---|---|---|---|
| ReMA | Full | 18.62 | 15.46 | 18.82 | 16.37 | 16.75 | 11.51 |
- End-to-end MLLMs exhibit a Working Memory Bottleneck as context grows, with performance saturating or degrading.
- Agentic baselines relying on global video localization collapse under month-scale sparsity, while ReMA scales with recursion and dynamic memory.
- ReMA achieves the highest accuracy on Val@Month (18.62%), and strongest grounding (Ref@300 of 16.37%) among evaluated methods on Month, Week, and Day sets.
- Finer perception granularity (e.g., 2-minute Δt) improves accuracy and grounding; full-video granularity degrades performance due to noise and reasoning cost.
- Using multimodal backbones (MLLMs) for perception and control yields better results than text-only controllers, illustrating the importance of multimodal alignment for lifecycle reasoning.
- GPT-5 demonstrates high reliability as a judge for automatic evaluation (F1 ~99.4%).

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。