QUICK REVIEW

[论文解读] Teaching Arithmetic to Small Transformers

Nayoung Lee, Kartik K. Sreenivasan|arXiv (Cornell University)|Jul 7, 2023

Topic Modeling被引用 8

一句话总结

该论文显示，小型仅解码器的 Transformer 能通过精心排布的数据从零开始学习算术，输出反转和思维链（chain-of-thought）草稿显著提升样本效率和超越标准矩阵补全直觉的泛化能力。

ABSTRACT

Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.

研究动机与目标

研究从随机初始化训练的小型 Transformer 如何学习算术运算。
评估数据格式与采样对在下一个 token 预测下学习算术的影响。
探索思维链（CoT）数据在算术任务训练中的作用。
分析预训练、模型规模与文本/算术数据混合对算术学习与泛化的影响。
分析长度泛化与把算术学习视为映射 vs. 算法的极限。

提出的方法

在算术任务上用随机初始化训练 NanoGPT（6 层，384 隐藏单元，约 1060 万参数）。
评估四种加法数据格式：Plain、Reverse、简化草稿板、详细草稿板。
使用结构化采样在 n 位输入中平衡数字和进位的分布。
将学习与低秩矩阵补全联系起来并分析样本复杂度的相变。
将实验扩展到思维链风格数据，以评估对学习速度和准确性的影响。
将与更大模型的预训练/微调设置进行比较，以研究规模与迁移效应。

实验结果

研究问题

RQ1小型 Transformer 模型是否能够通过下一个 token 预测从零开始学习算术？
RQ2数据格式与采样如何影响学习算术的样本效率和准确性？
RQ3思维链风格数据是否进一步提升从零开始学习算术任务的效果？
RQ4模型规模和预训练在获得算术能力中的作用是什么？
RQ5学习到的算术能力在遇到未见数字或数字长度增加时的泛化能力如何？

主要发现

Plain 加法数据的表现较差；输出反转显著提升准确度并减少所需训练数据。
随着训练数据增加，学习加法呈现明显的阶段转变，与低秩矩阵补全的直觉一致。
NanoGPT 学到的加法能泛化到未见数字及部分缺失数字集，超出标准 LRMC 限制，表明具有超越简单矩阵补全的能力。
思维链数据在从零开始学习加法时显著提升样本效率和准确性，且性能取决于中间步骤细节的程度。
对 3 位加法任务，数字与进位的平衡采样优于随机采样，提升了性能。
基于 Transformer 的加法学习与 LRMC 存在差异，提示除了矩阵补全之外的额外泛化机制。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。