QUICK REVIEW

[論文レビュー] Recursively Summarizing Enables Long-Term Dialogue Memory in Large Language Models

Qingyue Wang, Fu, Yanhe|arXiv (Cornell University)|Aug 29, 2023

Topic Modeling被引用数 10

ひとこと要約

本論文は、長期対話メモリを強化するために、LLMsを用いてメモリ要約を再帰的に生成する手法を提案し、ChatGPTおよび text-davinci-003 を用いたMSCで評価した結果、後半のセッションにおいて一貫性が改善されることを示した。

ABSTRACT

Recently, large language models (LLMs), such as GPT-4, stand out remarkable conversational abilities, enabling them to engage in dynamic and contextually relevant dialogues across a wide range of topics. However, given a long conversation, these chatbots fail to recall past information and tend to generate inconsistent responses. To address this, we propose to recursively generate summaries/ memory using large language models (LLMs) to enhance long-term memory ability. Specifically, our method first stimulates LLMs to memorize small dialogue contexts and then recursively produce new memory using previous memory and following contexts. Finally, the chatbot can easily generate a highly consistent response with the help of the latest memory. We evaluate our method on both open and closed LLMs, and the experiments on the widely-used public dataset show that our method can generate more consistent responses in a long-context conversation. Also, we show that our strategy could nicely complement both long-context (e.g., 8K and 16K) and retrieval-enhanced LLMs, bringing further long-term dialogue performance. Notably, our method is a potential solution to enable the LLM to model the extremely long context. The code and scripts are released.

研究の動機と目的

オープンドメインの長期対話における忘却問題を、ラベル付きデータや追加ツールなしで対処する。
短い文脈からの要約（メモリ）を再帰的に更新するメモリ管理スキームを提案する。
最新のメモリを利用して一貫した長文脈の応答を生成できる応答生成器を実現する。
さまざまなLLMでの有効性と堅牢性を示し、few-shotプロンプティングの潜在的利益を分析する。

提案手法

LLMをメモリマネージャーと応答生成器の双方として扱う。
メモリ更新：M_s = LLM(C_{t-1}, M_{s-1}, P_m) ここで C_{t-1} は短い文脈、P_m はメモリ管理プロンプト。
応答生成：r_t = LLM(C_t, M_s, P_r) ここで P_r は応答プロンプト。
メモリは以前のメモリと新しい発話を組み合わせて長期的一貫性のあるメモリを再帰的に更新する。
MSCデータセット上で固定されたLLM（ChatGPT、text-davinci-003）を用いて評価する。
All Context、Part Context、Gold Memoryを含むベースラインと比較する。

実験結果

リサーチクエスチョン

RQ1ラベル付きデータや追加ツールなしで、過去の対話を再帰的に要約することによりLLMは長期的な対話メモリを獲得できるか。
RQ2予測された（再帰的に生成された）メモリは、生の文脈や部分的な文脈を使用する場合よりも、長期対話においてより一貫性があり、まとまった応答を生み出すか。
RQ3異なるLLMに対してこの手法は堅牢であり、few-shot in-context learningの恩恵を受けることができるか。

主な発見

Method	Session2_DIST-1/2	Session2_F1	Session2_BLEU-1/2	Session3_DIST-1/2	Session3_F1	Session3_BLEU-1/2	Session4_DIST-1/2	Session4_F1	Session4_BLEU-1/2	Session5_DIST-1/2	Session5_F1	Session5_BLEU-1/2
All Context	3.85/24.43	15.28	15.99/2.38	4.05/25.19	15.53	17.25/2.38	3.68/23.20	15.44	15.75/2.22	3.67/23.43	15.84	16.20/2.33
Part Context	3.78/24.90	14.13	13.93/1.87	3.79/24.86	14.38	14.38/1.85	3.67/24.30	14.86	15.02/1.96	3.74/24.34	14.86	15.36/2.00
Gold Memory	3.78/24.90	14.13	13.93/1.87	4.02/25.72	15.34	16.26/2.18	4.08/25.75	15.95	17.13/2.43	4.08/25.93	16.36	17.57/2.41
Predicted Memory	4.17/26.01	15.71	17.55/2.51	4.34/26.44	15.41	17.55/2.24	4.42/26.68	15.84	18.41/2.45	4.47/27.04	16.25	18.66/2.47

予測メモリは、特に後半のMSCセッション（Session4およびSession5）で最高の性能を示すことが多い。
生成されたメモリは、ベースラインに対してF1およびBLEU-2の著しい改善をもたらし、いくつかの指標でGold Memoryを上回ることがある。
メモリ予測は、全文脈や部分的文脈を使用する場合よりも、長期情報の応答への統合と一貫性が高いことを示す。
この手法は異なるLLMs（例：ChatGPTおよび text-davinci-003）に対して堅牢である。
few-shotプロンプティング（1つのラベル付き例）は、メモリ品質と応答性能をさらに向上させる。
このアプローチは、メモリにおける幻覚（因果関係が誤っている可能性）が生じる可能性があり、今後の対策が必要である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。