QUICK REVIEW

[论文解读] Scaling Relationship on Learning Mathematical Reasoning with Large Language Models

Zheng Yuan, Hongyi Yuan|arXiv (Cornell University)|Aug 3, 2023

Topic Modeling被引用 8

一句话总结

论文分析预训练损失、监督数据和增强数据如何影响监督学习大语言模型的数学推理，并提出拒绝采样微调（RFT）来通过多样的推理路径增强数据，在标准SFT上实现显著提升。

ABSTRACT

Mathematical reasoning is a challenging task for large language models (LLMs), while the scaling relationship of it with respect to LLM capacity is under-explored. In this paper, we investigate how the pre-training loss, supervised data amount, and augmented data amount influence the reasoning performances of a supervised LLM. We find that pre-training loss is a better indicator of the model's performance than the model's parameter count. We apply supervised fine-tuning (SFT) with different amounts of supervised data and empirically find a log-linear relation between data amount and model performance, and we find better models improve less with enlarged supervised datasets. To augment more data samples for improving model performances without any human effort, we propose to apply Rejection sampling Fine-Tuning (RFT). RFT uses supervised models to generate and collect correct reasoning paths as augmented fine-tuning datasets. We find with augmented samples containing more distinct reasoning paths, RFT improves mathematical reasoning performance more for LLMs. We also find RFT brings more improvement for less performant LLMs. Furthermore, we combine rejection samples from multiple models which push LLaMA-7B to an accuracy of 49.3\% on GSM8K which outperforms the supervised fine-tuning (SFT) accuracy of 35.9\% significantly.

研究动机与目标

了解在监督微调（SFT）和上下文学习（ICL）下，预训练损失与数学推理性能的相关性。
表征增加监督数据在不同模型规模下对推理准确性的量化影响。
研究通过拒绝采样的数据增强以创建多样的推理路径及其对性能的影响。
展示将来自多模型的拒绝采样数据聚合的好处，并在GSM8K上与基线进行比较。

提出的方法

在多个LLM（LLaMA/LLaMA2变体）上对GSM8K作为数学推理基准，评估SFT和ICL的性能。
将性能作为预训练损失的函数来比较，而非模型规模或token数量。
分析随监督数据量变化的性能，以识别数据的对数线性扩展。
应用拒绝采样生成多条推理路径，筛选正确答案并对模型进行微调（RFT）。
对来自多个基础模型的拒绝采样数据进行去重和聚合，以研究多样性对性能的影响。
提供与现有基线的比较（ICL、SFT、来自单模型/多模型的RFT）在GSM8K上的表现。

实验结果

研究问题

RQ1预训练损失与LLM在数学推理的SFT和ICL性能之间的相关性如何？
RQ2监督数据量与模型在数学推理任务上的性能之间的关系是什么？
RQ3拒绝采样微调（RFT）是否改善数学推理，其性能如何随不同推理路径数量变化？
RQ4将来自多个模型的拒绝采样数据聚合是否比单一模型RFT带来额外收益？

主要发现

预训练损失在数学推理中的性能指示性优于参数量，在研究区间内准确度与预训练损失呈近似负线性关系。
SFT的性能随监督数据量呈对数线性增长，随着模型越具备预训练能力，回报递减。
当增强数据包含多条不同推理路径时，RFT对数学推理有提升，对较弱模型的收益更大。
将来自多个模型的拒绝采样数据聚合，在若干LLaMA/LLaMA2变体中比单模型RFT获得更高的准确率（如LLaMA-7B为49.3，LLaMA2-13B为55.4）。
RFT成本远低于预训练，提升预训练损失仍然是扩展数学推理能力的根本解决方案。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。