[论文解读] The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs
本论文考察输入顺序如何影响闭源大语言模型在多项任务上的表现(意译、相关性和MCQ),在输入打乱时显著降低性能;少量提示在较长提示时的缓解作用有限。
As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in LLMs whose internal components are hidden from users (such as closed-source models or those accessed via API calls). We conduct experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation; however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.
研究动机与目标
- 评估提示输入顺序如何影响闭源大语言模型在多项任务上的表现。
- 量化零-shot和少量-shot设置下对输入重新排序的鲁棒性。
- 识别影响对输入顺序敏感性的任务特征。
提出的方法
- 在五个任务上对原始输入顺序与打乱后的输入顺序进行GPT-4o和GPT-4o mini的实验。
- 为每个任务使用零-shot和少量-shot提示配置。
- 使用标准度量(如精确度、召回率、F1)分析性能,并报告因重排导致的增量变化。
实验结果
研究问题
- RQ1将语义等价元素的顺序对 paraphrase 任务(MRPC)输出是否有影响?
- RQ2输入顺序如何影响跨多样数据集的相关性判断和多项选择题回答?
- RQ3少量提示是否能够缓解跨任务和提示长度的顺序敏感性?
- RQ4输入长度是否与顺序引起的性能变化规模相关?
主要发现
- 在GPT-4o和GPT-4o mini上,打乱输入顺序会对多项任务造成可测量的性能下降。
- 更长的提示与对输入顺序变化的脆弱性呈正相关。
- 少量提示通常未能全面缓解顺序敏感性,且效果因模型与任务而异。
- 即使在强大的闭源大语言模型中,输入顺序敏感性依然存在,引发可靠性方面的担忧。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。