QUICK REVIEW

[論文レビュー] TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan, Yuanzhi Li|arXiv (Cornell University)|May 12, 2023

Topic Modeling被引用数 46

ひとこと要約

本論文は、GPT-3.5/4 によって生成された、超小型の言語モデル（パラメータ数10M未満）の訓練・評価用の子ども用語彙データセット TinyStories を紹介し、文法・創造性・指示遵守を評価する新しい GPT-4 ベースの評価パラダイム（GPT-Eval）を提案します。

ABSTRACT

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

研究の動機と目的

3～4歳児が理解できる語彙を用いた短編物語の合成データセットである TinyStories を導入する。
非常に小さなモデル（パラメータ数が10M未満）が流暢で一貫した物語を生成し、推論を示すことを実証する。
多次元的なモデル評価のための GPT-4 ベースの評価パラダイム（GPT-Eval）を提案する。
TinyStories が効率的な訓練を可能にし（単一 GPU 上で通常1日未満）、観察可能な注意・活性化パターンを伴う解釈可能なモデルを生み出すことを示す。
LM における言語能力の出現と、低リソース領域や専門領域における潜在的な利点についての洞察を提供する。

提案手法

制約された語彙（約1500語の基本語）とランダムな語彙・特徴プロンプトを用いて物語を生成させるよう GPT-3.5/4 に指示して TinyStories を作成する。
TinyStories-Instruct を提供する。各物語の前に指示セット（語、文、特徴、要約）を配置したバリアント。
GPT-Eval を開発する。与えられた始まりに対して文法、創造性、整合性を評価するために GPT-4 を用いてモデル出力を採点し、多次元的なスコアリングを可能にする。
TinyStories 上で非常に小さなモデル（1M–35Mパラメータ、1–8層）を、単一の V100 GPU 上で訓練する。ウィンドウ長256トークン、コンテキスト長512、埋め込みを縮小して256、トークナイザーを Top-10K にする。
アテンションヘッドとMLP活性化を分析して、モデルの挙動と生成過程を解釈する。
出力をより大きなモデル（例：GPT-2 XL）と比較し、小さなスケールでの能力の出現を示す。

実験結果

リサーチクエスチョン

RQ1一貫性があり流暢な英語生成に必要な最小のモデルサイズとアーキテクチャは何か？
RQ2TinyStories で訓練された非常に小さなモデルは事実知識と基本的な推論を獲得できるか？
RQ3TinyStories フレームワークは小さなモデルにおいて解釈可能な内部機構（注意・MLP活性化）を明らかにするか？
RQ4文法・創造性・指示遵守を評価するGPT-4主導の評価フレームワーク（GPT-Eval）はどれだけ有効か？

主な発見

TinyStories により、流暢で多様性があり文法的に一貫した物語を生成する、10Mパラメータ未満のモデルの訓練が可能になる。
小さなモデルは、サイズが限られていても事実知識と基本的な推論能力を示し始める。
TinyStories で訓練されたモデルは、文の役割と一致する解釈可能な注意パターンおよび組織化されたニューロン活性化を示す。
GPT-Eval フレームワークは、伝統的なベンチマークの限界に対処する、文法・創造性・指示遵守の多次元評価を提供する。
TinyStories での訓練は高速で（単一GPUで通常1日未満）、アーキテクチャやハイパーパラメータに渡りスケーラブルである。
埋め込みが小さく浅いアーキテクチャでも、特定の物語生成タスクでははるかに大きなモデルの出力の一部を凌ぐことができる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。