QUICK REVIEW

[论文解读] TinyStories: How Small Can Language Models Be and Still Speak Coherent English?

Ronen Eldan, Yuanzhi Li|arXiv (Cornell University)|May 12, 2023

Topic Modeling被引用 46

一句话总结

论文介绍 TinyStories，一种合成的、儿童词汇表的数据集，由 GPT-3.5/4 生成，用于训练和评估超小语言模型（参数量低于 10M）以及一种基于 GPT-4 的新评估范式（GPT-Eval），用于评估语法、创造力和遵循指令能力。

ABSTRACT

Language models (LMs) are powerful tools for natural language processing, but they often struggle to produce coherent and fluent text when they are small. Models with around 125M parameters such as GPT-Neo (small) or GPT-2 (small) can rarely generate coherent and consistent English text beyond a few words even after extensive training. This raises the question of whether the emergence of the ability to produce coherent English text only occurs at larger scales (with hundreds of millions of parameters or more) and complex architectures (with many layers of global attention). In this work, we introduce TinyStories, a synthetic dataset of short stories that only contain words that a typical 3 to 4-year-olds usually understand, generated by GPT-3.5 and GPT-4. We show that TinyStories can be used to train and evaluate LMs that are much smaller than the state-of-the-art models (below 10 million total parameters), or have much simpler architectures (with only one transformer block), yet still produce fluent and consistent stories with several paragraphs that are diverse and have almost perfect grammar, and demonstrate reasoning capabilities. We also introduce a new paradigm for the evaluation of language models: We suggest a framework which uses GPT-4 to grade the content generated by these models as if those were stories written by students and graded by a (human) teacher. This new paradigm overcomes the flaws of standard benchmarks which often requires the model's output to be very structures, and moreover provides a multidimensional score for the model, providing scores for different capabilities such as grammar, creativity and consistency. We hope that TinyStories can facilitate the development, analysis and research of LMs, especially for low-resource or specialized domains, and shed light on the emergence of language capabilities in LMs.

研究动机与目标

介绍 TinyStories，一个使用3到4岁儿童能够理解的词汇的合成短篇故事数据集。
证明非常小的模型（参数量低于 10M）能够生成流畅、连贯的故事并展现推理能力。
提出一种基于 GPT-4 的评估范式（GPT-Eval），用于多维度模型评估。
展示 TinyStories 能实现高效训练（通常在单个 GPU 上不超过一天），并产生具有可观测注意力/激活模式的可解释模型。
提供对语言模型语言能力涌现的见解，以及对低资源或专业领域的潜在益处。

提出的方法

通过提示 GPT-3.5/4 使用受限词汇（约 1500 个基础词）和随机单词/特征提示来生成故事，从而最大化多样性。
提供 TinyStories-Instruct，一个变体，在每个故事前置指令集（词语、一个句子、特征、摘要）。
开发 GPT-Eval：使用 GPT-4 对模型完成稿在语法、创造力和与给定开头的一致性方面进行评分，从而实现多维评分。
在 TinyStories 上用单个 V100 GPU 训练非常小的模型（1M–35M 参数；1–8 层），使用 256-token 窗口和 512-context 长度，嵌入降至 256，并使用 top-10K 分词器。
分析注意力头和 MLP 激活以解释模型行为与生成过程。
将输出与更大模型（如 GPT-2 XL）进行比较，以说明在小规模下能力的涌现。

实验结果

研究问题

RQ1生成连贯、流畅英语所需的最低模型规模和架构是什么？
RQ2在 TinyStories 上训练的非常小的模型是否能够获得事实知识和基本推理能力？
RQ3TinyStories 框架是否在小模型中揭示可解释的内部机制（注意力/MLP 激活）？
RQ4基于 GPT-4 的评估框架（GPT-Eval）在评估语法、创造力和遵循指令方面的效果如何？

主要发现

TinyStories 使得参数远小于 10M 的模型也能够训练出生成流畅、多样、语法连贯的故事。
尽管规模有限，较小的模型开始展现事实知识和基本推理能力。
在 TinyStories 上训练的模型显示出可解释的注意力模式和与句子角色对齐的结构化神经元激活。
GPT-Eval 框架提供对语法、创造力和遵循指令的多维评估，解决了传统基准的局限性。
在 TinyStories 上的训练速度很快（通常在单个 GPU 上不超过一天），并且可扩展到不同架构和超参数。
即使使用小型嵌入和浅层架构，模型也能在特定故事生成任务中超越一些更大模型的输出。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。