[論文レビュー] Learning to Remember: End-to-End Training of Memory Agents for Long-Context Reasoning
The paper introduces Unified Memory Agent (UMA), an end-to-end reinforcement learning framework that jointly learns memory management and reasoning for long-context tasks, plus Ledger-QA as a dynamic state-tracking benchmark.
Long-context LLMs and Retrieval-Augmented Generation (RAG) systems process information passively, deferring state tracking, contradiction resolution, and evidence aggregation to query time, which becomes brittle under ultra long streams with frequent updates. We propose the Unified Memory Agent (UMA), an end-to-end reinforcement learning framework that unifies memory operations and question answering within a single policy. UMA maintains a dual memory representation: a compact core summary for global context and a structured Memory Bank that supports explicit CRUD (create, update, delete, reorganize) over key value entries, enabling proactive consolidation during streaming. To evaluate long-horizon memory behavior, we introduce Ledger-QA, a diagnostic benchmark for continuous state tracking where answers are latent values derived from accumulated updates rather than lo cal span retrieval. Across 13 datasets spanning Ledger-QA, Test-Time Learning, and Accurate Retrieval, UMA substantially outperforms long-context and RAG baselines on dynamic reasoning and learning tasks while remaining competitive on standard retrieval benchmarks, underscoring the importance of learned, end-to-end memory management.
研究の動機と目的
- Motivate the need for active, learned memory management in ultra-long contexts beyond passive retrieval.
- Propose UMA that unifies memory operations (CRUD) with question answering in a single policy.
- Introduce Ledger-QA as a diagnostic benchmark for continuous state tracking over long horizons.
- Demonstrate that end-to-end memory optimization yields superior dynamic reasoning and competitive retrieval performance.
提案手法
- Formulate long-context reasoning as an MDP with a dual-memory state: a core summary and a structured Memory Bank (CRUD over key-value entries).
- Use a two-phase architecture: Phase I for sequential memory maintenance over chunks, Phase II for hybrid QA with retrieval from raw text and structured memory.
- Train with Task-Stratified Group Relative Policy Optimization (GRPO), leveraging nested trajectory sampling to estimate memory and QA advantages.
- Employ a two-stage reward design combining tool usage success and final answer correctness, with stratified normalization to credit memory and QA steps appropriately.
- Evaluate with ledger-style dynamic state tracking (Ledger-QA) and standard TTL/AR benchmarks across 13 datasets.
実験結果
リサーチクエスチョン
- RQ1Can end-to-end optimization of memory operations improve long-horizon reasoning over ultra-long contexts?
- RQ2Does a unified memory + QA policy outperform retrieval-centric baselines on dynamic state tracking tasks?
- RQ3What is the contribution of the memory maintenance phase and RL training to overall performance?
- RQ4How well does Ledger-QA probe true state-tracking capabilities versus local span retrieval?
主な発見
- UMA substantially outperforms long-context and RAG baselines on dynamic reasoning tasks across 13 datasets.
- UMA remains competitive on standard retrieval benchmarks, showing that learned memory management benefits generality.
- Ablations show memory maintenance and RL training are both crucial to peak performance.
- Task-Stratified GRPO provides effective credit assignment for heterogeneous memory and QA objectives.
- Ledger-QA challenges reveal baseline brittleness as horizon grows, while UMA maintains robust accuracy.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。