[論文レビュー] Lost in the Middle: How Language Models Use Long Contexts
この論文は、現在の言語モデルが長い入力文脈を十分に活用できていないことを示している。関連情報が長い文脈の中ほどにある場合に性能が低下し、評価プロトコルと分析をアーキテクチャ、プロンプティング、ファインチューニング全体に渡って提案している。
While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.
研究の動機と目的
- Investigate how state-of-the-art LMs utilize long input contexts in downstream tasks like multi-document QA and key-value retrieval.
- Examine how the position of relevant information within long contexts affects model performance.
- Evaluate differences across model architectures, prompting schemes, and instruction fine-tuning in long-context settings.
- Provide actionable insights and evaluation protocols for improving long-context usage in LMs.
提案手法
- Conduct controlled experiments on multi-document QA by varying input context length and the position of the relevant document.
- Use a synthetic key-value retrieval task to probe basic retrieval from long contexts.
- Compare decoder-only vs. encoder-decoder architectures to assess robustness to context position.
- Test query-aware contextualization by placing the query before/after data to assess contextualization effects.
- Analyze the impact of instruction fine-tuning on context usage patterns.
実験結果
リサーチクエスチョン
- RQ1How robust are current language models to changes in the position of relevant information within long input contexts?
- RQ2Do longer-context models necessarily perform better, or do they exhibit position-dependent (serial) biases?
- RQ3How do model architecture (decoder-only vs. encoder-decoder), query-aware contextualization, and instruction fine-tuning influence long-context utilization?
- RQ4In open-domain QA, does increasing retrieved context translate to meaningful gains for readers?
主な発見
| Model | Closed-Book | Oracle |
|---|---|---|
| LongChat-13B (16K) | 35.0% | 83.4% |
| MPT-30B-Instruct | 31.5% | 81.9% |
| GPT-3.5-Turbo | 56.1% | 88.3% |
| GPT-3.5-Turbo (16K) | 56.0% | 88.6% |
| Claude-1.3 | 48.3% | 76.1% |
| Claude-1.3 (100K) | 48.2% | 76.4% |
- Performance shows a U-shaped curve: highest when relevant information is at the beginning or end of the input context, and degraded in the middle.
- Extended-context models do not universally outperform non-extended ones when the relevant content is within tested lengths.
- Encoder-decoder models show more robustness within training-length sequences, but exhibit U-shaped degradation on longer sequences.
- Query-aware contextualization dramatically improves key-value retrieval (near-perfect on larger k), with minimal impact on multi-document QA trends.
- Instruction fine-tuning does not remove the U-shaped bias; trends persist across models and scales.
- In open-domain QA, reader performance saturates well before retriever recall, indicating limited use of added retrieved documents.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。