QUICK REVIEW

[논문 리뷰] Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity

Yao Lu, Max Bartolo|arXiv (Cornell University)|2021. 04. 18.

Topic Modeling인용 수 119

한 줄 요약

본 논문은 맥락 내 학습에서 프롬프트 샘플 순서가 모델과 작업 전반의 성능에 상당한 영향을 미친다는 점을 보여주고, 추가 라벨링 데이터 없이도 성능 좋은 순서를 자동으로 식별하는 엔트로피 기반 프로빙을 도입하여 상당한 개선을 이끈다.

ABSTRACT

When primed with only a handful of training samples, very large, pretrained language models such as GPT-3 have shown competitive results when compared to fully-supervised, fine-tuned, large, pretrained language models. We demonstrate that the order in which the samples are provided can make the difference between near state-of-the-art and random guess performance: essentially some permutations are "fantastic" and some not. We analyse this phenomenon in detail, establishing that: it is present across model sizes (even for the largest current models), it is not related to a specific subset of samples, and that a given good permutation for one model is not transferable to another. While one could use a development set to determine which permutations are performant, this would deviate from the true few-shot setting as it requires additional annotated data. Instead, we use the generative nature of language models to construct an artificial development set and based on entropy statistics of the candidate permutations on this set, we identify performant prompts. Our method yields a 13% relative improvement for GPT-family models across eleven different established text classification tasks.

연구 동기 및 목표

맥락 내 학습에서 샘플 순서가 모델 규모와 작업 전반에 걸쳐 성능에 상당한 영향을 미친다는 점을 보여준다.
좋은 순열은 모델 간이나 작업 간에 전달되지 않는다는 점을 보여준다.
라벨이 달린 개발 데이터를 사용하지 않고도 성능 좋은 프롬프트 순서를 자동으로 식별하는 프로빙 기반 방법을 제안한다.
언어 모델의 생성 특성을 활용해 평가를 위한 비라벨링 프로빙 세트를 구성한다.
다양한 데이터셋과 모델에 걸친 엔트로피 기반 프로빙으로 얻은 개선 정도를 정량화한다.]
method:[
Analyze order sensitivity across GPT-2/GPT-3 models using 4-shot prompts on SST-2 and other datasets.
Construct a probing set by sampling from the language model to generate unlabeled examples corresponding to training samples.
Define Global Entropy와 Local Entropy 지표를 정의하여 프로빙 세트 예측을 바탕으로 후보 프롬프트 순서를 순위 매긴다.
Select top-k prompt orders (k=4) with highest entropy scores and evaluate on multiple datasets.
Demonstrate that entropy-based probing improves average performance by ~13% relative across eleven tasks.

제안 방법

Analyze order sensitivity across GPT-2/GPT-3 models using 4-shot prompts on SST-2 and other datasets.
Construct a probing set by sampling from the language model to generate unlabeled examples corresponding to training samples.
Define Global Entropy와 Local Entropy 지표를 정의하여 프로빙 세트 예측을 바탕으로 후보 프롬프트 순서를 순위 매긴다.
Select top-k prompt orders (k=4) with highest entropy scores and evaluate on multiple datasets.
Demonstrate that entropy-based probing improves average performance by ~13% relative across eleven tasks.

실험 결과

연구 질문

RQ1소수 샷 프롬프트 순서 민감성은 모델 규모와 작업에 보편적인가?
RQ2라벨이 달린 개발 데이터 없이도 성능 좋은 프롬프트 순서를 자동으로 식별할 수 있는가?
RQ3좋은 프롬프트 순열은 모델 간이나 작업 간에 전달되는가?
RQ4엔트로피 기반 프로빙 지표는 서로 다른 템플릿과 데이터셋에 대해 견고한가?

주요 결과

프롬프트 순서의 가변성으로 GPT-2/GPT-3 크기에 따라 성능이 최상급에 거의 근접하거나 거의 무작위에 이를 정도로 흔들릴 수 있다.
하나의 좋은 순열은 서로 다른 모델이나 데이터셋에 신뢰성 있게 전달되지 않는다.
Global Entropy와 Local Entropy는 비라벨링 프로빙 세트를 사용해 신뢰성 있게 성능 좋은 프롬프트 순서를 식별한다.
엔트로피 기반 프로빙은 11개 텍스트 분류 과제에서 평균 13% 상대 개선을 달성한다(모델 크기에 걸쳐).
선정된 프롬프트는 모든 후보 순서를 사용하는 것에 비해 성능 분산이 크게 감소한다.
프롬프트 선정을 위한 엔트로피 기반 프로빙은 간단한 검증 집합 튜닝이나 데이터 분할 방법보다 우수하다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.