QUICK REVIEW

[論文レビュー] Teaching Arithmetic to Small Transformers

Nayoung Lee, Kartik K. Sreenivasan|arXiv (Cornell University)|Jul 7, 2023

Topic Modeling被引用数 8

ひとこと要約

小型のデコーダ中心のトランスフォーマーが、 carefully formatted data を用いて scratch から算術を学習できることを示し、出力反転とチェーン・オブ・ソート（CoT）・スクラッチパッドがサンプル効率と汎化を従来の行列補完の直感を超えて著しく向上させる。

ABSTRACT

Large language models like GPT-4 exhibit emergent capabilities across general-purpose tasks, such as basic arithmetic, when trained on extensive text data, even though these tasks are not explicitly encoded by the unsupervised, next-token prediction objective. This study investigates how small transformers, trained from random initialization, can efficiently learn arithmetic operations such as addition, multiplication, and elementary functions like square root, using the next-token prediction objective. We first demonstrate that conventional training data is not the most effective for arithmetic learning, and simple formatting changes can significantly improve accuracy. This leads to sharp phase transitions as a function of training data scale, which, in some cases, can be explained through connections to low-rank matrix completion. Building on prior work, we then train on chain-of-thought style data that includes intermediate step results. Even in the complete absence of pretraining, this approach significantly and simultaneously improves accuracy, sample complexity, and convergence speed. We also study the interplay between arithmetic and text data during training and examine the effects of few-shot prompting, pretraining, and model scale. Additionally, we discuss length generalization challenges. Our work highlights the importance of high-quality, instructive data that considers the particular characteristics of the next-word prediction objective for rapidly eliciting arithmetic capabilities.

研究の動機と目的

ランダム初期化から訓練された小型トランスフォーマーが算術演算を学習できるかを調査する。
次のトークン予測の下で、データ形式とサンプリングが学習効率と算術学習に与える影響を評価する。
算術タスクの訓練におけるチェーン・オブ・ソート（CoT）データの役割を探る。
事前訓練、モデル規模、およびテキスト/算術データの混合が算術学習と一般化に与える影響を examine する。
算術を mappings かつアルゴリズムとして学習する際の長さ一般化と限界を分析する。

提案手法

ランダム初期化からの NanoGPT（6 層、384 隠れ層、約1060万パラメータ）を算術タスクで訓練する。
加算のための4つのデータ形式を評価する：Plain、Reverse、Simplified Scratchpad、Detailed Scratchpad。
n 桁入力にわたる数字とキャリーをバランスさせる構造化サンプリングを使用する。
学習を低ランク行列補完と関連付け、サンプル効率の相転移を分析する。
学習速度と精度への影響を評価するため、CoT 風データを実験に拡張する。
規模と転移効果を調べるために、より大きなモデルでの事前訓練/ファインチューニング設定と比較する。

実験結果

リサーチクエスチョン

RQ1小型トランスフォーマーモデルは、次のトークン予測を用いてScratchから算術を学習できるか？
RQ2データ形式とサンプリングは、算術学習のサンプル効率と精度にどのような影響を与えるか？
RQ3CoT 風データは、特に Scratch からの算術タスク学習をさらに改善するか？
RQ4モデル規模と事前訓練は算術能力の獲得にどのような役割を果たすか？
RQ5学習した算術能力は、見たことのない数字や長い桁数へどの程度一般化するか？

主な発見

Plain の加算データは性能が低く、出力反転は精度を大幅に改善し、必要な訓練データ量を削減する。
訓練データが増えると加算学習に急激な相転移が生じ、低ランク行列補完の直観と一致する。
NanoGPT が学習した加算は、標準的な LRMC の限界を超えて、見たことのない数値や一部欠損桁セットへ一般化できることを示し、単なる行列補完以上の能力を示唆する。
CoT データはScratch からでも加算の学習におけるサンプル効率と精度を著しく向上させ、中間ステップの細かさのレベルに依存する。
3桁加算タスクで、数値とキャリーのバランスのとれたサンプリングはランダムサンプリングより性能を改善する。
トランスフォーマーを用いた加算学習は LRMC とは異なり、行列補完を超える追加の一般化メカニズムを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。