QUICK REVIEW

[논문 리뷰] The Order Effect: Investigating Prompt Sensitivity to Input Order in LLMs

Bao Guan, Tanya Roosta|ArXiv.org|2025. 02. 06.

Auction Theory and Applications인용 수 3

한 줄 요약

본 논문은 입력 순서가 닫힌 소스 LLM의 성능에 미치는 영향을 다양한 작업에서 분석하며(의역, 관련성, MCQ), 입력이 섞이면 성능이 크게 저하되고, Few-shot 프롬프트가 특히 긴 프롬프트의 경우 제한적으로 완화한다.

ABSTRACT

As large language models (LLMs) become integral to diverse applications, ensuring their reliability under varying input conditions is crucial. One key issue affecting this reliability is order sensitivity, wherein slight variations in the input arrangement can lead to inconsistent or biased outputs. Although recent advances have reduced this sensitivity, the problem remains unresolved. This paper investigates the extent of order sensitivity in LLMs whose internal components are hidden from users (such as closed-source models or those accessed via API calls). We conduct experiments across multiple tasks, including paraphrasing, relevance judgment, and multiple-choice questions. Our results show that input order significantly affects performance across tasks, with shuffled inputs leading to measurable declines in output accuracy. Few-shot prompting demonstrates mixed effectiveness and offers partial mitigation; however, fails to fully resolve the problem. These findings highlight persistent risks, particularly in high-stakes applications, and point to the need for more robust LLMs or improved input-handling techniques in future development.

연구 동기 및 목표

프롬프트 입력 순서가 여러 작업에 걸쳐 닫힌 소스 LLM의 성능에 미치는 영향을 평가한다.
제로샷 및 파샷 설정에서 입력 재배열에 대한 강건성을 정량화한다.
입력 순서에 대한 민감도에 영향을 주는 작업 특성을 확인한다.

제안 방법

원래 입력 순서와 섞인 입력 순서를 비교하여 다섯 가지 작업에서 GPT-4o 및 GPT-4o 미니에 대한 실험을 수행한다.
각 작업에 대해 제로샷 및 파샷 프롬프팅 구성을 사용한다.
표준 지표(예: 정밀도, 재현율, F1)로 성능을 분석하고 재배열로 인한 차이를 보고한다.

실험 결과

연구 질문

RQ1의미상 동등한 요소의 순서를 바꾸는 것이 의역 작업(MRPC)에서 LLM 출력에 영향을 미치는가?
RQ2다양한 데이터세트에서 입력 순서가 관련성 판단 및 객관식 응답에 어떤 영향을 미치는가?
RQ3Few-shot 프롬프팅이 작업 및 프롬프트 길이에 걸쳐 순서 민감성을 완화할 수 있는가?
RQ4입력 길이가 순서로 인한 성능 변화의 크기와 상관관계가 있는가?

주요 결과

입력 순서를 섞는 것은 GPT-4o 및 GPT-4o 미니에서 다중 작업에 걸쳐 측정 가능한 성능 저하를 야기한다.
더 긴 프롬프트일수록 순서 변화에 대한 취약성이 더 큰 것으로 나타난다.
Few-shot 프롬프팅은 모델과 작업에 따라 혼합 효과를 보이며 순서 민감성을 완전히 완화하지 못하는 경우가 많다.
강력한 닫힌 소스 LLM에서도 순서 민감성이 지속되어 신뢰성에 대한 우려를 제기한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.