QUICK REVIEW

[论文解读] Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Bradley Brown, Jordan Juravsky|arXiv (Cornell University)|Jul 31, 2024

Neural Networks and Applications被引用 9

一句话总结

本文研究把重复采样作为扩大LLMs推理计算的一种方法，展示在各种任务和模型上的覆盖率提升，并分析成本效益及验证挑战。

ABSTRACT

Scaling the amount of compute used to train language models has dramatically improved their capabilities. However, when it comes to inference, we often limit models to making only one attempt at a problem. Here, we explore inference compute as another axis for scaling, using the simple technique of repeatedly sampling candidate solutions from a model. Across multiple tasks and models, we observe that coverage -- the fraction of problems that are solved by any generated sample -- scales with the number of samples over four orders of magnitude. Interestingly, the relationship between coverage and the number of samples is often log-linear and can be modelled with an exponentiated power law, suggesting the existence of inference-time scaling laws. In domains like coding and formal proofs, where answers can be automatically verified, these increases in coverage directly translate into improved performance. When we apply repeated sampling to SWE-bench Lite, the fraction of issues solved with DeepSeek-Coder-V2-Instruct increases from 15.9% with one sample to 56% with 250 samples, outperforming the single-sample state-of-the-art of 43%. In domains without automatic verifiers, we find that common methods for picking from a sample collection (majority voting and reward models) plateau beyond several hundred samples and fail to fully scale with the sample budget.

研究动机与目标

将重复采样作为推理时计算的一个可扩展维度，超越单次尝试。
量化在多任务和多种模型家族下，随着样本预算增加，覆盖率（解决的问题数量）如何提高。
评估在使用大量样本而非单一更大模型时的成本含义与模型选择权衡。
检视在选取正确样本时，精确度/验证方法的局限性，并确定改进验证器的方向。

提出的方法

通过正温度采样为每个问题生成大量候选解。
在可能时，使用领域特定的验证器（单元测试、证明检查器）来选择最终答案。
将覆盖率（pass@k）定义为任何样本解决的问题所占的比例，并使用无偏估计来降低方差。
用一个指数幂律模型 c ≈ exp(a k^{-b}) 来拟合 log(coverage) 对样本数量的关系。
通过将 FLOPs 转换为推理成本来比较成本效率，并将较弱模型的多样本与较强模型的少样本进行对比。

Figure 1 : The repeated sampling procedure that we follow in this paper. 1) We generate many candidate solutions for a given problem by sampling from an LLM with a positive temperature. 2) We use a domain-specific verifier (ex. unit tests for code) to select a final answer from the generated samples

实验结果

研究问题

RQ1增加每个问题的样本数量是否在不同任务和模型家族中可靠地提高覆盖率？
RQ2重复采样与模型大小及数据领域（编码、证明、数学文字题）如何相互作用以影响覆盖率与成本？
RQ3验证器（多数投票、奖励模型）能否跟上不断增加的样本预算？在哪些方面会达到瓶颈？
RQ4在启用重复采样时，验证中的实际限制和失败模式（如不稳定的测试、假阴性）有哪些？
RQ5观察到的覆盖曲线是否遵循可用于指导推理时计算预算的缩放规律？

主要发现

模型	每次尝试成本（美元）	尝试次数	解决的问题比例（%）	总成本（美元）	相对总成本
DeepSeek-V2-Coder-Instruct	0.008	5	12	29.62	1x
GPT-4o	0.13	1	39	24.00	3.25x
Claude 3.5 Sonnet	0.17	1	51	26.70	4.25x

在五个任务和多种模型家族中，样本越多覆盖越高；例如 SWE-bench Lite 在使用较弱模型、250 次样本时达到 56% 解决率。
在 CodeContests 上，Gemma-2B 的 pass@1 在 10k 样本下从 0.02% 提升到 7.1%，相对于基线提升了约 300 倍。
覆盖率与样本预算的关系通常遵循对数线性或指数幂律形式，便于在采样规模扩大时预测收益。
在给出大量样本时，较弱模型可能胜过单次尝试的更强模型，並出现具有成本效益的权衡（例如 DeepSeek 的五次样本运行可以击败单次 GPT-4o/Claude）。
像多数投票或奖励模型评分这样的精确性方法在约 100 次样本时达到平台，突显在缺少自动验证时需要更好的验证方法。
在数学文字题中，Llama-3 的覆盖率在 10k 样本时超过 95%，但常见的样本选择方法会达到平台，显示覆盖率与最终答案准确率之间仍存在差距。

Figure 2 : Across five tasks, we find that coverage (the fraction of problems solved by at least one generated sample) increases as we scale the number of samples. Notably, using repeated sampling, we are able to increase the solve rate of an open-source method from 15.9% to 56% on SWE-bench Lite.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。