QUICK REVIEW

[論文レビュー] Do LLMs Benefit From Their Own Words?

Jenny Y. Huang, Leshem Choshen|arXiv (Cornell University)|Feb 27, 2026

AI in Service Interactions被引用数 0

ひとこと要約

要約: 本研究は、多ターン prompting で事前のアシスタント応答を省略しても、文脈長を大幅に削減しつつ応答品質を維持または改善するケースがあることを示し、文脈汚染のケースとアダプティブな戦略を提案して選択的にアシスタント履歴を省く。

ABSTRACT

Multi-turn interactions with large language models typically retain the assistant's own past responses in the conversation history. In this work, we revisit this design choice by asking whether large language models benefit from conditioning on their own prior responses. Using in-the-wild, multi-turn conversations, we compare standard (full-context) prompting with a user-turn-only prompting approach that omits all previous assistant responses, across three open reasoning models and one state-of-the-art model. To our surprise, we find that removing prior assistant responses does not affect response quality on a large fraction of turns. Omitting assistant-side history can reduce cumulative context lengths by up to 10x. To explain this result, we find that multi-turn conversations consist of a substantial proportion (36.4%) of self-contained prompts, and that many follow-up prompts provide sufficient instruction to be answered using only the current user turn and prior user turns. When analyzing cases where user-turn-only prompting substantially outperforms full context, we identify instances of context pollution, in which models over-condition on their previous responses, introducing errors, hallucinations, or stylistic artifacts that propagate across turns. Motivated by these findings, we design a context-filtering approach that selectively omits assistant-side context. Our findings suggest that selectively omitting assistant history can improve response quality while reducing memory consumption.

研究の動機と目的

実世界の多ターン会話において、過去のアシスタント出力を保持することが下流の応答品質を改善するかを調査する。
prior アシスタント応答が実際に次のターンで有用である頻度を定量化する。
過去の応答が性能を損なう現象（文脈汚染）を特定し、その髙頻度を特徴づける。
品質と効率を最適化するために、アシスタント履歴の含有・省略を適応的に決定する方法を開発する。

提案手法

WildChat および ShareLM の実世界の多ターンチャットを用いて、Full Context（過去の全ターンを含む）と Assistant-Omitted（AO） prompting を比較する。
4つのLLM（Qwen3-4B、DeepSeek-R1-Distill-Llama-8B、GPT-OSS-20B、GPT-5.2）を評価する。
AO では過去のアシスタントターンをプレースホルダーに置換して構造を保持するプロンプト設定を用いる。
回答品質とタスク遵守を、ユーザーのターンのみを見るビューと、全履歴を見るビューの2観点で、LMM-judge（GPT-5）を用いて評価する。
新規質問（New Ask）、フィードバック付きフォローアップ、フィードバックなしフォローアップのカテゴリに prompts を分類し、過去のアシスタント応答への依存を分析する。
過去の応答がパフォーマンスを低下させる文脈汚染のケースを測定する。
FC（Full Context）を優先するか AO を優先するかをターンごとに予測するロジスティック回帰分類器を用いた適応的文脈戦略を提案する。

実験結果

リサーチクエスチョン

RQ1実世界の多ターンチャットは、複数のモデルを横断して過去のアシスタント応答による条件付けから利益を得るのか。
RQ2ターンのどの程度が自己完結的で、現在と過去のユーザーターンだけで解決可能か。
RQ3過去のアシスタント応答による文脈汚染の有病率と影響はどの程度か。
RQ4品質を損なわずに文脈長を削減し、適応的にアシスタント履歴を省略できるか。

主な発見

過去のアシスタント応答を保持することが一様に有益とは限らない。AO で品質を維持するモデルもあれば、FC コンテキストで評価した場合に AO で低下するモデルもある。
ユーザーターンのみを見た評価者によって評価すると、AO は4モデルすべてで応答品質を改善することが多い。
AO プロンプトは、FC プロンプトに比べて文脈長を約5–10倍大幅に削減する。
36.4% のターンは自己完結型の新規質問（self-contained new-ask）であり、フォローアップで具体的な指示がある場合は、ユーザーターンのみで初期解決できることが多い。
文脈汚染のケースが存在し、過去のアシスタント出力が誤りや現実と異なる記述を導入し、ターン間でその誤りが伝播することがある。
分類器を用いた適応的文脈省略アプローチは、FC パフォーマンスの95％超を保持しつつトークン使用を大幅に削減できる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。