[Paper Review] ES-MemEval: Benchmarking Conversational Agents on Personalized Long-Term Emotional Support
This paper introduces ES-MemEval, a benchmark for evaluating long-term memory in personalized emotional-support dialogues, plus EvoEmo, a multi-session dataset; it analyzes open-source, commercial, and retrieval-augmented LLMs across QA, summarization, and dialogue generation tasks.
Large Language Models (LLMs) have shown strong potential as conversational agents. Yet, their effectiveness remains limited by deficiencies in robust long-term memory, particularly in complex, long-term web-based services such as online emotional support. However, existing long-term dialogue benchmarks primarily focus on static and explicit fact retrieval, failing to evaluate agents in critical scenarios where user information is dispersed, implicit, and continuously evolving. To address this gap, we introduce ES-MemEval, a comprehensive benchmark that systematically evaluates five core memory capabilities: information extraction, temporal reasoning, conflict detection, abstention, and user modeling, in long-term emotional support settings, covering question answering, summarization, and dialogue generation tasks. To support the benchmark, we also propose EvoEmo, a multi-session dataset for personalized long-term emotional support that captures fragmented, implicit user disclosures and evolving user states. Extensive experiments on open-source long-context, commercial, and retrieval-augmented (RAG) LLMs show that explicit long-term memory is essential for reducing hallucinations and enabling effective personalization. At the same time, RAG improves factual consistency but struggles with temporal dynamics and evolving user states. These findings highlight both the potential and limitations of current paradigms and motivate more robust integration of memory and retrieval for long-term personalized dialogue systems.
Motivation & Objective
- Motivate the need for robust long-term memory in emotional-support agents beyond static retrieval.
- Define five core memory capabilities (information extraction, temporal reasoning, conflict detection, abstention, user modeling) in long-term ES scenarios.
- Propose ES-MemEval as a three-task benchmark (QA, summarization, dialogue generation) to evaluate memory capabilities.
- Provide EvoEmo as a multi-session dataset capturing evolving user states for personalized long-term ES.
- Offer empirical insights into strengths/limitations of open-source, commercial, and RAG LLMs for long-term personalization.
Proposed method
- Propose three benchmarking tasks (QA, summarization, dialogue generation) to probe five memory abilities.
- Construct EvoEmo by building 18 virtual users with event timelines and multi-session sessions generated via GPT-4o and human validation.
- Evaluate models across open-source long-context, commercial, and retrieval-augmented configurations using standardized metrics (F1, BERTScore, LLM-as-Judge, ROUGE, event-based metrics, and observation-based ratings).
- Use session-level retrieval with a dense retriever (bge-m3) over a FAISS index to supply memory for RAG configurations.
- Analyze retrieval granularity (turn, round, session) and context length effects on QA, summarization, and dialogue generation.

Experimental results
Research questions
- RQ1How well do LLMs maintain and utilize long-term memory across evolving, implicit user disclosures in ES scenarios?
- RQ2What are the relative strengths and weaknesses of open-source, commercial, and retrieval-augmented models for long-term ES tasks?
- RQ3To what extent do memory capabilities (IE, TR, CD, Abs, UM) predict performance in QA, summarization, and dialogue generation?
- RQ4Does retrieval augmentation improve factual consistency and personalization without destabilizing temporal dynamics?
- RQ5What retrieval granularity and context length best support long-term emotional support dialogues?
Key findings
- Explicit long-term memory is essential to reduce hallucinations and enable personalization.
- Retrieval-augmented (RAG) setups improve factual consistency but struggle with temporal dynamics and evolving user states.
- Personalization strongly correlates with long-term memory, while emotional support benefits from general strategies.
- Session-level retrieval best captures evolving user information and improves memory-aligned responses over other granularities.
- RAG narrows the gap between open-source and commercial systems by enhancing memory alignment of generated responses.
- Smaller long-context models degrade with extra-long inputs, highlighting the need for memory retrieval integration.

Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.