QUICK REVIEW

[论文解读] s1: Simple test-time scaling

Niklas Muennighoff, Zitong Yang|ArXiv.org|Jan 31, 2025

Fault Detection and Control Systems被引用 10

一句话总结

论文提出了一种用于测试时扩展的最小化方法：在1k个推理样本上微调模型并应用预算强制来控制思考时间，取得了与 OpenAI o1-preview 相当的推理性能和数据效率，与开源代码和数据相比。

ABSTRACT

Test-time scaling is a promising new approach to language modeling that uses extra test-time compute to improve performance. Recently, OpenAI's o1 model showed this capability but did not publicly share its methodology, leading to many replication efforts. We seek the simplest approach to achieve test-time scaling and strong reasoning performance. First, we curate a small dataset s1K of 1,000 questions paired with reasoning traces relying on three criteria we validate through ablations: difficulty, diversity, and quality. Second, we develop budget forcing to control test-time compute by forcefully terminating the model's thinking process or lengthening it by appending "Wait" multiple times to the model's generation when it tries to end. This can lead the model to double-check its answer, often fixing incorrect reasoning steps. After supervised finetuning the Qwen2.5-32B-Instruct language model on s1K and equipping it with budget forcing, our model s1-32B exceeds o1-preview on competition math questions by up to 27% (MATH and AIME24). Further, scaling s1-32B with budget forcing allows extrapolating beyond its performance without test-time intervention: from 50% to 57% on AIME24. Our model, data, and code are open-source at https://github.com/simplescaling/s1

研究动机与目标

说明为什么测试时计算可以提升语言模型的推理性能。
创建一个小型高质量的推理数据集（s1K），包含多样、困难的问题及推理轨迹。
展示一个简单的测试时干预（预算强制）以控制思考时长并改进答案。
证明在1K样本上微调能实现强劲、样本高效的推理性能。
提供开源数据、模型和代码，以便复现和进一步研究。

提出的方法

从多样来源策划一个包含59K个问题的池子，选取标准为质量、难度和多样性。
通过基于模型的难度评估和 MSC 基于领域多样性，筛选出1K个高质量、多样且困难的样本（s1K）。
在 s1K 上对 Qwen2.5-32B-Instruct 进行有监督微调（SFT），得到 s1-32B，16 台 H100 GPU 26 分钟内完成。
在测试时引入预算强制以控制思考：(i) 通过追加 end-of-thinking 标记来结束思考，(ii) 通过追加 Wait 来延长思考以鼓励更多探索。
评估测试时扩展的序贯（预算强制）与并行（多数投票）方法，并与基线进行比较。
使用开源数据、权重和代码，项目仓库中可获得。

实验结果

研究问题

RQ1一个最小、数据高效的方法是否能在推理任务上实现强大的测试时扩展？
RQ2数据集质量、难度和多样性如何影响指令微调在推理上的有效性？
RQ3序贯测试时扩展（预算强制）是否比像多数投票这样的并行方法更有效？
RQ4在具挑战性的推理基准上，随着测试时计算量增加，预算强制带来的性能提升是多少？
RQ5与更大数据池相比，1K样本的 s1K 数据集在实现数据效率和竞争性性能方面如何？

主要发现

对一个32B模型进行1K样本的有监督微调在 MATH 和 AIME24 上的表现与 o1-preview 相当。
预算强制使测试时计算可控，并通过鼓励模型进行验证和更长的探索来提升推理能力。
s1-32B 在测试时扩展方面表现强劲，随着允许的思考标记增多，性能提升（在达到收益递减点前）持续上升。
该方法具有数据效率：在1K样本上的训练在多数基线之上，同时使用的示例数量远少于更大数据池。
数据选择在难度、多样性和质量之间取得平衡至关重要；仅随机或过长的推理轨迹的效果通常不及所提出的三标准筛选。
s1-32B 在用1K样本训练时，几乎赶上 Gemini 2.0 Thinking 在 AIME24 的水平，且保持开源。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。