QUICK REVIEW

[论文解读] Lost in the Middle: How Language Models Use Long Contexts

Nelson F. Liu, Kevin Lin|arXiv (Cornell University)|Jul 6, 2023

Topic Modeling被引用 57

一句话总结

论文表明当前语言模型没有对长输入上下文进行稳健利用；当相关信息位于长上下文中间时，性能下降，并且引入评估协议与跨架构、提示与微调的分析。

ABSTRACT

While recent language models have the ability to take long contexts as input, relatively little is known about how well they use longer context. We analyze the performance of language models on two tasks that require identifying relevant information in their input contexts: multi-document question answering and key-value retrieval. We find that performance can degrade significantly when changing the position of relevant information, indicating that current language models do not robustly make use of information in long input contexts. In particular, we observe that performance is often highest when relevant information occurs at the beginning or end of the input context, and significantly degrades when models must access relevant information in the middle of long contexts, even for explicitly long-context models. Our analysis provides a better understanding of how language models use their input context and provides new evaluation protocols for future long-context language models.

研究动机与目标

Investigate how state-of-the-art LMs utilize long input contexts in downstream tasks like multi-document QA and key-value retrieval.
Examine how the position of relevant information within long contexts affects model performance.
Evaluate differences across model architectures, prompting schemes, and instruction fine-tuning in long-context settings.
Provide actionable insights and evaluation protocols for improving long-context usage in LMs.

提出的方法

Conduct controlled experiments on multi-document QA by varying input context length and the position of the relevant document.
Use a synthetic key-value retrieval task to probe basic retrieval from long contexts.
Compare decoder-only vs. encoder-decoder architectures to assess robustness to context position.
Test query-aware contextualization by placing the query before/after data to assess contextualization effects.
Analyze the impact of instruction fine-tuning on context usage patterns.

实验结果

研究问题

RQ1How robust are current language models to changes in the position of relevant information within long input contexts?
RQ2Do longer-context models necessarily perform better, or do they exhibit position-dependent (serial) biases?
RQ3How do model architecture (decoder-only vs. encoder-decoder), query-aware contextualization, and instruction fine-tuning influence long-context utilization?
RQ4In open-domain QA, does increasing retrieved context translate to meaningful gains for readers?

主要发现

模型	闭卷	Oracle
LongChat-13B (16K)	35.0%	83.4%
MPT-30B-Instruct	31.5%	81.9%
GPT-3.5-Turbo	56.1%	88.3%
GPT-3.5-Turbo (16K)	56.0%	88.6%
Claude-1.3	48.3%	76.1%
Claude-1.3 (100K)	48.2%	76.4%

Performance shows a U-shaped curve: highest when relevant information is at the beginning or end of the input context, and degraded in the middle.
Extended-context models do not universally outperform non-extended ones when the relevant content is within tested lengths.
Encoder-decoder models show more robustness within training-length sequences, but exhibit U-shaped degradation on longer sequences.
Query-aware contextualization dramatically improves key-value retrieval (near-perfect on larger k), with minimal impact on multi-document QA trends.
Instruction fine-tuning does not remove the U-shaped bias; trends persist across models and scales.
In open-domain QA, reader performance saturates well before retriever recall, indicating limited use of added retrieved documents.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。