QUICK REVIEW

[论文解读] The Larger the Better? Improved LLM Code-Generation via Budget Reallocation

Michael Hassid, Tal Remez|arXiv (Cornell University)|Mar 31, 2024

Digital Rights Management and Security被引用 5

一句话总结

在固定的计算预算下，使用较小的 LLM 生成多份输出在代码生成任务中可以优于由更大模型产生的单份输出，尽管在没有单元测试的情况下，基于排名的选择不如直接使用更大的模型有效。

ABSTRACT

It is a common belief that large language models (LLMs) are better than smaller-sized ones. However, larger models also require significantly more time and compute during inference. This begs the question: what happens when both models operate under the same budget? (e.g., compute, run-time). To address this question, we analyze code generation LLMs of various sizes and make comparisons such as running a 70B model once vs. generating five outputs from a 13B model. We consider a standard unit-test setup, which can be used to select the correct output from the smaller model. Our findings reveal that the repeated use of smaller models can yield consistent improvements, with gains of up to 15% across five tasks. On the other hand, in scenarios where unit-tests are unavailable, a ranking-based selection of candidates from the smaller model falls short of the performance of a single output from larger ones. Our results highlight the potential of using smaller models instead of larger ones, and the importance of studying approaches for ranking LLM outputs.

研究动机与目标

研究在固定计算预算下，代码生成任务是偏向使用更大模型（单输出）还是使用较小模型（多输出）？
在 Code Llama 模型规模下，使用 FLOPs 和 wall-time 指定计算预算来量化性能。
在缺少单元测试时，评估基于排名的选择的有效性。
提供公开数据以支持面向预算的代码生成模型使用的研究。

提出的方法

通过在预算允许的情况下生成尽可能多的输出，调整 pass@k 评估以比较不同规模的模型在固定计算下的表现。
定义 pass@flops 和 pass@time，以基于 FLOPs/时间约束分配 k 个输出（方程 2 和 3）。
在 HumanEval、MBPP 和 APPS 基准上使用 Code Llama 模型（7B、13B、34B、70B），并采用指定的提示策略和解码设置。
比较贪婪解码和采样，并报告 k = floor(n/2) 的 pass@k 以增强鲁棒性（如 Chen 等，2021）。
使用负对数似然作为评分类则考察基于排名的输出选择（方程 4–6），并评估将 LLM 作为排名器（方程 7）。
发布大规模输出（Code Llama 7B）以支持研究。

Figure 1: Different ways to improve LLM performance by increasing compute budget. Top: the standard approach of increasing model size, while generating a single output. Bottom: our approach: using a small model to generate multiple outputs, and select the best one.

实验结果

研究问题

RQ1在固定计算预算下，较小模型能否在代码生成任务中超越较大模型？
RQ2基于 FLOPs 与基于 wall-time 的预算如何影响不同模型规模的相对表现？
RQ3在固定预算下，缺少单元测试的基于排名的选择是否接近较大模型的性能？
RQ4将较大模型作为小模型输出的排名器的有效性如何？
RQ5哪些数据和策略最能支持面向预算的代码生成模型部署？

主要发现

在所有计算预算下，较小模型（7B、13B）在 HumanEval 和 MBPP 上的表现优于较大模型（34B、70B），提升幅度最高可达 15%。
在 APPS 上，13B 模型在大多数预算下表现最佳，在最难的比赛分段上比较大模型大约领先 5%。
在固定预算下，较小模型可用显著更少的计算资源达到与较大模型相当或更好的性能（例如 7B/13B 更快达到目标分数）。
基于排名的选择在有预算和排名器规模时有所提升，但在同一预算下仍落后于单个更大模型的输出。
将大语言模型用作较小模型输出的排名器可以提升性能，但在固定预算下，使用更大模型的贪婪解码仍然更优。
作者发布了超过 100 万条 Code Llama 7B 的 HumanEval 与 MBPP 输出以支持研究。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。