QUICK REVIEW

[论文解读] The Effect of Sampling Temperature on Problem Solving in Large Language Models

Matthew Renze, Erhan Guven|arXiv (Cornell University)|Feb 7, 2024

Natural Language Processing Techniques被引用 14

一句话总结

该研究通过实证测试采样温度（0.0 到 1.0）如何影响跨模型与提示的LLM问题解决，并发现对 MCQA 任务的准确性没有统计显著影响。

ABSTRACT

In this research study, we empirically investigate the effect of sampling temperature on the performance of Large Language Models (LLMs) on various problem-solving tasks. We created a multiple-choice question-and-answer (MCQA) exam by randomly sampling problems from standard LLM benchmarks. Then, we used nine popular LLMs with five prompt-engineering techniques to solve the MCQA problems while increasing the sampling temperature from 0.0 to 1.6. Despite anecdotal reports to the contrary, our empirical results indicate that changes in temperature from 0.0 to 1.0 do not have a statistically significant impact on LLM performance for problem-solving tasks. In addition, these results appear to generalize across LLMs, prompt-engineering techniques, and problem domains. All code, data, and supplemental materials are available on GitHub at: https://github.com/matthewrenze/jhu-llm-temperature

研究动机与目标

激发理解 LLM 问题解决的最优采样温度需求。
评估温度变化是否会影响跨多个领域的问题解决准确性。
比较多样化的 LLM 及提示工程技术的性能。
提供实证证据以指导提示工程的最佳实践并减少以往的轶事性说法。

提出的方法

通过从标准基准中抽取题目，构建一个多领域的 MCQA 考试。
使用五种提示工程技术评估四种 LLM（GPT-3.5、GPT-4、Llama 2 7B、Llama 2 70B）。
在推理过程中将采样温度在 0.0 到 1.0 之间变化。
以准确率为主要指标并计算若干文本相似性指标。
使用 Kruskal-Wallis 检验在显著性水平 α = 0.05 下评估温度效应的统计显著性。

Figure 1: Accuracy by temperature and prompt for GPT-3.5 with 1,000 questions. Performance remains relatively stable across all temperatures and prompts. However, there is a non-significant decrease in performance as a function of temperature.

实验结果

研究问题

RQ1在 MCQA 任务中，将采样温度从 0.0 提高到 1.0 或降低是否会影响 LLM 的问题解决准确性？
RQ2温度效应在不同模型与提示工程技术之间是否具有一致性？
RQ3温度如何影响输出的变异性（以文本相似性指标衡量）？

主要发现

GPT-3.5 在 1,000 道题考试中，平均准确率在所有温度下相对稳定。
Kruskal-Wallis 检验显示在评估的提示和模型中，不同温度之间的准确率无统计显著差异。
更高的温度增加文本变异性，表现在跨提示和领域的文本相似性指标下降。
某些 Llama 模型在 100 道题考试上表现接近随机猜测，表明存在模型或格式相关的局限。
在温度值超过 1.0 时，准确率下降，甚至接近随机猜测，与随机性增加一致。

Figure 2: Accuracy by temperature and model. Performance remains stable across sampling temperatures for all four LLMs on the 100-question MCQA exam. However, both Llama 2 models performed no better than statistically random guesses.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。