QUICK REVIEW

[论文解读] LifeBench: A Benchmark for Long-Horizon Multi-Source Memory

Zihao Cheng, Weixin Wang|arXiv (Cornell University)|Mar 4, 2026

Multimodal Machine Learning Applications被引用 0

一句话总结

LifeBench 引入一个长期、多源记忆基准，将陈述性与非陈述性记忆推理结合自不同数字轨迹，显示现有系统最多只能达到55.2%的准确率。

ABSTRACT

Long-term memory is fundamental for personalized agents capable of accumulating knowledge, reasoning over user experiences, and adapting across time. However, existing memory benchmarks primarily target declarative memory, specifically semantic and episodic types, where all information is explicitly presented in dialogues. In contrast, real-world actions are also governed by non-declarative memory, including habitual and procedural types, and need to be inferred from diverse digital traces. To bridge this gap, we introduce Lifebench, which features densely connected, long-horizon event simulation. It pushes AI agents beyond simple recall, requiring the integration of declarative and non-declarative memory reasoning across diverse and temporally extended contexts. Building such a benchmark presents two key challenges: ensuring data quality and scalability. We maintain data quality by employing real-world priors, including anonymized social surveys, map APIs, and holiday-integrated calendars, thus enforcing fidelity, diversity and behavioral rationality within the dataset. Towards scalability, we draw inspiration from cognitive science and structure events according to their partonomic hierarchy; enabling efficient parallel generation while maintaining global coherence. Performance results show that top-tier, state-of-the-art memory systems reach just 55.2\% accuracy, highlighting the inherent difficulty of long-horizon retrieval and multi-source integration within our proposed benchmark. The dataset and data synthesis code are available at https://github.com/1754955896/LifeBench.

研究动机与目标

通过整合陈述性与非陈述性记忆推理，模拟人类级别的长期记忆。
从聊天记录、应用、健康记录等多源数据中创建密集、覆盖一年时间的真实世界先验的数据集。
在复杂的多源记忆任务上评估当前记忆系统并识别失败模式。
提供可重复性包，包含数据、合成框架与开放许可下的文档。

提出的方法

提出一个受认知启发的五模块LLM合成管线：人设综合、分层规划、日常活动仿真、手机数据生成、QA 生成。
使用部位分解层级来分解事件，并确保跨年度轨迹的一致性。
实现双代理日常活动仿真（主观推理由LLM完成，基于地图与约束的客观落地）。
生成丰富的手机数据工件（联系人、短信、通话、日历、聊天、健康数据），以映射多源观测。
使用标准化管线和LLM评审来评估记忆系统在各记忆类别上的QA准确性。

实验结果

研究问题

RQ1如何合成由多记忆系统驱动的长期、密连的用户轨迹？
RQ2多源、片段化数据（聊天、日历、健康记录、应用数据）是否能支持鲁棒的记忆推理？
RQ3当前记忆系统在长期、多源任务上有哪些极限？
RQ4非陈述性记忆要素（习惯、技能、情感）如何影响检索与推理？
RQ5哪些数据质量与可扩展性策略能使基准真实且便于保护隐私？

主要发现

在 LifeBench 上，顶级记忆系统仅实现55.2%的整体准确率，凸显该基准的难度。
MemOS 在若干类别上优于 Hindsight 与 MemU，但在非陈述性记忆推理与不可回答查询方面表现欠佳。
现有基准评估高估了性能，因为 LifeBench 引入了密集、多源、长期数据，需进行复杂推理。
LifeBench 的数据表现出高度的推理性与多样性，以及稳健的关系一致性与位置真实性指标，但在细粒度指标多样性方面仍有改进空间。
作者提供了一个 Apache-2.0 可重复性包，包括数据集、合成框架与文档。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。