[论文解读] Understanding LLM Performance Degradation in Multi-Instance Processing: The Roles of Instance Count and Context Length
论文评估大型语言模型在多实例处理中的性能下降,显示出在小实例数时存在下降趋势,随着实例数增大出现崩溃,实例数对性能的影响强于上下文长度。
Users often rely on Large Language Models (LLMs) for processing multiple documents or performing analysis over a number of instances. For example, analysing the overall sentiment of a number of movie reviews requires an LLM to process the sentiment of each review individually in order to provide a final aggregated answer. While LLM performance on such individual tasks is generally high, there has been little research on how LLMs perform when dealing with multi-instance inputs. In this paper, we perform a comprehensive evaluation of the multi-instance processing (MIP) ability of LLMs for tasks in which they excel individually. The results show that all LLMs follow a pattern of slight performance degradation for small numbers of instances (approximately 20-100), followed by a performance collapse on larger instance counts. Crucially, our analysis shows that while context length is associated with this degradation, the number of instances has a stronger effect on the final results. This finding suggests that when optimising LLM performance for MIP, attention should be paid to both context length and, in particular, instance count.
研究动机与目标
- 激励并理解大型语言模型如何处理需要分析多份文档的多实例处理(MIP)任务。
- 表征随着实例数量增加,LLM的性能下降模式。
- 量化上下文长度相对于实例数对MIP性能的相对影响。
提出的方法
- 对多实例处理任务进行全面评估,其中每个实例先进行单独分析再进行聚合。
- 分析实例数从小规模到大规模增加时的性能趋势。
- 检验上下文长度与降解之间的关联,并将其影响与实例数进行比较。
实验结果
研究问题
- RQ1在多实例处理任务中,实例数量增加时,LLM的性能如何变化?
- RQ2在相对于实例数量的驱动下降中,上下文长度的作用是什么?
- RQ3LLMs在不同模型和任务中是否呈现两阶段的降解模式(初期小幅下降,随后崩溃)?
- RQ4哪一个因素更强预测最终的MIP性能:实例数量还是上下文长度?
主要发现
- 对少量实例(大约 20–100),LLMs 显示出轻微的性能下降模式。
- 在较大实例数量下,各模型的性能会崩溃。
- 上下文长度与下降相关,但实例数量对最终结果的影响更强。
- 在优化 MIP 的 LLM 性能时,关注上下文长度以及尤其是实例数量都很重要。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。