QUICK REVIEW

[논문 리뷰] The Effect of Sampling Temperature on Problem Solving in Large Language Models

Matthew Renze, Erhan Guven|arXiv (Cornell University)|2024. 02. 07.

Natural Language Processing Techniques인용 수 14

한 줄 요약

본 연구는 샘플링 온도(0.0에서 1.0)가 모델과 프롬프트에 걸쳐 LLM의 문제 해결에 어떤 영향을 미치는지 경험적으로 시험하고 MCQA 과제에서 정확도에 통계적으로 유의미한 영향을 발견하지 못했다.

ABSTRACT

In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.6. Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to generalize across LLMs, prompt-engineering techniques, and problem domains. All code, data, and supplemental materials are available on GitHub at: https://github.com/matthewrenze/jhu-llm-temperature

연구 동기 및 목표

LLM 문제 해결을 위한 최적 샘플링 온도 이해의 필요성에 대한 동기를 부여한다.
여러 도메인에 걸쳐 온도 변화가 문제 해결 정확도에 영향을 미치는지 평가한다.
다양한 LLM과 프롬프트 엔지니어링 기법 간의 성능을 비교한다.
프롬프트 엔지니어링의 모범 사례를 알리고 주관적 주장을 줄이기 위한 경험적 증거를 제공한다.

제안 방법

표준 벤치마크에서 문제를 샘플링하여 다도메인 MCQA 시험을 구성한다.
네 가지 LLM(GPT-3.5, GPT-4, Llama 2 7B, Llama 2 70B)을 다섯 가지 프롬프트 엔지니어링 기법으로 평가한다.
추론 중 샘플링 온도를 0.0에서 1.0까지 변화시킨다.
정확도를 주요 지표로 측정하고 여러 텍스트 유사도 지표를 계산한다.
알파 = 0.05에서 온도 효과의 통계적 유의성을 평가하기 위해 Kruskal-Wallis 검정을 사용한다.

Figure 1: Accuracy by temperature and prompt for GPT-3.5 with 1,000 questions. Performance remains relatively stable across all temperatures and prompts. However, there is a non-significant decrease in performance as a function of temperature.

실험 결과

연구 질문

RQ1샘플링 온도를 0.0에서 1.0으로 증가시키거나 감소시키는 것이 MCQA 과제에서 LLM의 문제 해결 정확도에 영향을 미치는가?
RQ2온도 효과가 서로 다른 모델과 프롬프트 엔지니어링 기법에서 일관된가?
RQ3온도가 텍스트 유사도 지표로 측정된 출력 변동성에 어떤 영향을 미치는가?

주요 결과

1,000문제 시험에서 GPT-3.5의 평균 정확도는 모든 온도에서 비교적 안정적으로 유지된다.
평가된 프롬프트와 모델에서 온도 간 정확도에 통계적으로 유의미한 차이가 Kruskal-Wallis 검정에서 나타나지 않았다.
높은 온도는 텍스트-유사도 지표가 도메인과 프롬프트에 걸쳐 감소하는 것으로 보아 텍스트 변동성을 증가시킨다.
일부 Llama 모델은 100문제 시험에서 무작위 추측 수준에 가깝게 수행하여 모델 또는 형식 관련 제한을 시사한다.
1.0을 넘는 온도 값에서는 정확도가 감소하고 무작위 추측에 가까워질 수 있으며, 증가된 무작위성과 일치한다.

Figure 2: Accuracy by temperature and model. Performance remains stable across sampling temperatures for all four LLMs on the 100-question MCQA exam. However, both Llama 2 models performed no better than statistically random guesses.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.