QUICK REVIEW

[論文レビュー] The Effect of Sampling Temperature on Problem Solving in Large Language Models

Matthew Renze, Erhan Guven|arXiv (Cornell University)|Feb 7, 2024

Natural Language Processing Techniques被引用数 14

ひとこと要約

本研究は、サンプリング温度（0.0 〜 1.0）がモデルやプロンプトを横断するLLMの問題解決能力に与える影響を実証的に検証し、MCQAタスクの正確性に統計的に有意な影響がないことを等しく示した。

ABSTRACT

In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.6. Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to generalize across LLMs, prompt-engineering techniques, and problem domains. All code, data, and supplemental materials are available on GitHub at: https://github.com/matthewrenze/jhu-llm-temperature

研究の動機と目的

LLMの問題解決における最適なサンプリング温度を理解する必要性を動機づける。
複数のドメインにわたって温度の変化が問題解決の正確さに影響を与えるか評価する。
多様なLLMとプロンプトエンジニアリング手法間の性能を比較する。
プロンプトエンジニアリングのベストプラクティスを通知する実証的証拠を提供し、経験的主張を減らす。

提案手法

標準ベンチマークから問題をサンプルして、マルチドメインMCQA試験を構築する。
5つのプロンプトエンジニアリング手法を用いて、4つのLLM（GPT-3.5、GPT-4、Llama 2 7B、Llama 2 70B）を評価する。
推論時にサンプリング温度を0.0〜1.0に変化させる。
正確性を主要指標として測定し、いくつかのテキスト類似度指標を計算する。
有意水準α = 0.05で温度効果の統計的有意性を評価するためにKruskal-Wallis検定を用いる。

Figure 1: Accuracy by temperature and prompt for GPT-3.5 with 1,000 questions. Performance remains relatively stable across all temperatures and prompts. However, there is a non-significant decrease in performance as a function of temperature.

実験結果

リサーチクエスチョン

RQ1サンプリング温度を0.0から1.0へ増減させることが、MCQAタスクでのLLMの問題解決正確性に影響を与えるか？
RQ2温度効果は、異なるモデルとプロンプトエンジニアリング手法間で一貫しているか？
RQ3温度は、テキスト類似度指標で測定される出力の変動性にどのように影響するか？

主な発見

平均正確性は、GPT-3.5の1,000問試験で、すべての温度で比較的安定している。
Kruskal-Wallis検定は、評価対象のプロンプトとモデル間で温度別の正確性に統計的に有意な差を示さない。
より高い温度はテキストの変動性を高め、プロンプトとドメイン全体でテキスト類似度指標の低下として示される。
いくつかのLlamaモデルは100問試験でランダム確率程度あるいはそれに近い性能を示し、モデルやフォーマットに起因する制限を示唆している。
温度値が1.0を超えると正確性は低下し、ランダム推測に近づくことがある。

Figure 2: Accuracy by temperature and model. Performance remains stable across sampling temperatures for all four LLMs on the 100-question MCQA exam. However, both Llama 2 models performed no better than statistically random guesses.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。