QUICK REVIEW

[論文レビュー] Rethinking Evaluation in Retrieval-Augmented Personalized Dialogue: A Cognitive and Linguistic Perspective

Tianyi Zhang, David Traum|arXiv (Cornell University)|Mar 15, 2026

Topic Modeling被引用数 0

ひとこと要約

本論文は retrieval-augmented personalized dialogue における LAPDOG の評価とデータ実践を批判し、表層の類似度指標が一貫性・整合性・共通理解に基づく人間/LLM 判断と乖離していることを示し、認知的に informed な評価方向を提案する。

ABSTRACT

In cognitive science and linguistic theory, dialogue is not seen as a chain of independent utterances but rather as a joint activity sustained by coherence, consistency, and shared understanding. However, many systems for open-domain and personalized dialogue use surface-level similarity metrics (e.g., BLEU, ROUGE, F1) as one of their main reporting measures, which fail to capture these deeper aspects of conversational quality. We re-examine a notable retrieval-augmented framework for personalized dialogue, LAPDOG, as a case study for evaluation methodology. Using both human and LLM-based judges, we identify limitations in current evaluation practices, including corrupted dialogue histories, contradictions between retrieved stories and persona, and incoherent response generation. Our results show that human and LLM judgments align closely but diverge from lexical similarity metrics, underscoring the need for cognitively grounded evaluation methods. Broadly, this work charts a path toward more reliable assessment frameworks for retrieval-augmented dialogue systems that better reflect the principles of natural human communication.

研究の動機と目的

現在の評価実践が retrieval-augmented personalized dialogue における会話品質をどのように捉えているかを問い直す。
History・Retrieval・Metrics における LAPDOG を事例として限界を特定する。
discourseと認知理論に基づく人間とLLMの判断を組み合わせたフレームワークを提案する。
コヒーレンス・エンゲージメント・共通理解を反映する指標へ向けた方向性を示す。

提案手法

崩れた対話で LAPDOG の実験を再現し、ベースラインの改善を検証する。
崩れていない CONVAI2 データで LAPDOG とベースラインを再訓練し、人間とLLMの評価を収集する。
2名の人間評価者と2名のLLM評価者（ChatGPT由来と DeepSeek）を用い、応答を1–5段階で評価し候補をランク付けする。
表層的な類似度指標（BLEU、ROUGE、F1）と人間/LLM 判断を比較する。
Pearson 相関と Williams 検定を用いて人間・LLM・語彙指標間の相関を分析する。

Figure 1: Overview of the LAPDOG retrieval-augmented personalized dialogue framework. The model retrieves external stories (e.g., from ROCStory) based on persona and dialogue history using a dual-encoder retriever, integrates them to a generator, and evaluates responses with metrics such as BLEU and

実験結果

リサーチクエスチョン

RQ1表層的な語彙指標は、コヒーレンスとペルソナ一貫性に関する人間およびLLM判断と一致するか。
RQ2認知的・言語学的観点から LAPDOG の retrieval-history-coherence パイプラインの限界は何か。
RQ3人間とLLMの評価者は認知的根拠に基づく対話品質の信頼できる評価を提供できるか。
RQ4コヒーレンスと共通理解を高める評価フレームワークと retrieval filtering 戦略は何か。

主な発見

人間とLLMの判断は相互に一致するが、BLEU/ROUGE/F1 のような表層指標とは乖離する。
崩れた履歴と表層指標で観察される LAPDOG の利得は、崩れていないデータでの知覚品質の向上へ一貫して結びつかない。
retrieved-story の内容がペルソナ情報と矛盾し、信頼性と整合性を低下させることがある。
崩れた対話履歴は談話構造を乱し、発話間のコヒーレンスを損なう。
語彙的重複指標は人間/LLM 判断と弱い、または負の相関を示し、認知的根拠に基づく指標の必要性を強調する。
LLM評価者はこの設定で人間判断を近似できる可能性を示唆しており、実践的な LL Mベースの評価の有効性を示す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。