QUICK REVIEW

[论文解读] The Good, The Bad, and The Greedy: Evaluation of LLMs Should Not Ignore Non-Determinism

Yifan Song, Guoyin Wang|arXiv (Cornell University)|Jul 15, 2024

Law, AI, and Intellectual Property被引用 7

一句话总结

论文通过比较贪婪解码和采样在多种基准和模型上的非确定性，展示贪婪常胜，存在显著例外，并强调对齐、规模和best-of-N策略的影响。

ABSTRACT

Current evaluations of large language models (LLMs) often overlook non-determinism, typically focusing on a single output per example. This limits our understanding of LLM performance variability in real-world applications. Our study addresses this issue by exploring key questions about the performance differences between greedy decoding and sampling, identifying benchmarks' consistency regarding non-determinism, and examining unique model behaviors. Through extensive experiments, we observe that greedy decoding generally outperforms sampling methods for most evaluated tasks. We also observe consistent performance across different LLM sizes and alignment methods, noting that alignment can reduce sampling variance. Moreover, our best-of-N sampling approach demonstrates that smaller LLMs can match or surpass larger models such as GPT-4-Turbo, highlighting the untapped potential of smaller LLMs. This research shows the importance of considering non-determinism in LLM evaluations and provides insights for future LLM development and evaluation.

研究动机与目标

动机：评估LLM输出中的非确定性，而非仅仅依赖单一确定性结果。
描述在多样化基准中何时贪婪解码优于采样，何时不优越。
评估非确定性效应在模型规模和对齐方法上的一致性。
探讨如规模、对齐、温度和重复惩罚等因素对非确定性生成的影响。
演示best-of-N采样在释放较小LLM能力方面的潜力。

提出的方法

在七个基准上比较贪婪解码和核采样，基准包括 AlpacaEval 2、Arena-Hard、WildBench v2、MixEval、MMLU-Redux、GSM8K 和 HumanEval。
评估多种开源权重的LLM以及一个专有的GPT-4-Turbo基线。
对大多数基准采样16个完成，MMLU-Redux为32，GSM8K和HumanEval为128。
研究缩放、对齐方法（如DPO、KTO、SimPO）、温度和重复惩罚等效应。
使用带奖励模型的best-of-N采样来对回答进行排序并选择最佳响应，与oracle上界进行基准比较。

Figure 1: Alignment effects on non-determinism.

实验结果

研究问题

RQ1Q1：在不同基准和模型之间，贪婪解码与采样之间的性能差距有何不同？
RQ2Q2：何时贪婪解码优于采样，反之又为何？
RQ3Q3：在非确定性方面，哪些基准最具一致性/最不具一致性？
RQ4Q4：是否有模型在不同任务中呈现出独特的非确定性模式？

主要发现

在大多数基准上，贪婪解码通常优于采样，尽管排序会随配置而变化。
AlpacaEval 是一个例外，采样显示更高的胜率。
输出空间受限的基准（如 MixEval、MMLU）显示更稳定，而数学和编程任务（GSM8K、HumanEval）更易受采样方差影响。
结果在不同模型尺寸和族群上是一致的；对齐方法在许多任务中可以降低采样方差。
带奖励模型的best-of-N采样可以使较小的LLM在若干任务上达到甚至超越GPT-4-Turbo，oracle best-of-N 展示了上界潜力。

Figure 2: (a) Temperature effects on non-determinism. (b) Repetition penalty effects on generation. We compare performance of Llama-3-8B-Instruct with different generation parameters.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。