QUICK REVIEW

[论文解读] Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, Sébastien Bubeck|arXiv (Cornell University)|Sep 11, 2023

Topic Modeling被引用 48

一句话总结

一个参数量为 1.3B 的 Transformer（phi-1.5）主要在合成教材风格数据上训练，取得与更大模型相当的常识与语言理解，并在多步推理与编码方面表现出色，同时开源以帮助研究指令遵循、偏见与幻觉问题的研究。

ABSTRACT

We continue the investigation into the power of smaller Transformer-based language models as initiated by extbf{TinyStories} -- a 10 million parameter model that can produce coherent English -- and the follow-up work on extbf{phi-1}, a 1.3 billion parameter model with Python coding performance close to the state-of-the-art. The latter work proposed to use existing Large Language Models (LLMs) to generate ``textbook quality" data as a way to enhance the learning process compared to traditional web data. We follow the ``Textbooks Are All You Need" approach, focusing this time on common sense reasoning in natural language, and create a new 1.3 billion parameter model named extbf{phi-1.5}, with performance on natural language tasks comparable to models 5x larger, and surpassing most non-frontier LLMs on more complex reasoning tasks such as grade-school mathematics and basic coding. More generally, extbf{phi-1.5} exhibits many of the traits of much larger LLMs, both good -- such as the ability to ``think step by step" or perform some rudimentary in-context learning -- and bad, including hallucinations and the potential for toxic and biased generations -- encouragingly though, we are seeing improvement on that front thanks to the absence of web data. We open-source extbf{phi-1.5} to promote further research on these urgent topics.

研究动机与目标

调查使用合成教材风格数据的小型 LLM 如何达到更高水平的能力
评估 phi-1.5 在常识推理、语言任务与多步推理方面的表现，与更大模型相比
探索数据质量 vs. 规模的作用及使用筛选网页数据的影响
考察相对于网络训练基线，phi-1.5 的安全性、毒性与偏见特征
开源该模型，以促进对上下文学习、可解释性与安全性的研究

提出的方法

构建 phi-1.5，一个 1.3B 参数的 Transformer，24 层，32 头，上下文长度 2048
在约 300 亿个 token 上训练，其中 7B 来自 phi-1，约 200 亿为合成教材风格数据；80% 为合成数据，20% 为 phi-1 数据
使用训练设置：恒定学习率 2e-4，权重衰减 0.1，Adam（0.9,0.98）动量，fp16，DeepSpeed ZeRO-2，批量大小 2048
创建变体 phi-1.5-web-only 与 phi-1.5-web，在筛选网页数据（约 95B token）上训练，并采用 40/20/40 的混合数据集（web/synthetic/code）
在常识基准（WinoGrande、ARC-Easy、ARC-Challenge、BoolQ、SIQA）、语言理解基准（PIQA、Hellaswag、OpenBookQA、SQuAD、MMLU）以及多步推理基准（GSM8K、HumanEval/MBPP）上进行零-shot 与少样本评估
与开源基线（Llama2-7B、Vicuna-13B、Falcon-7B 等）进行对比，并报告网页数据 vs. 合成数据对性能的影响

Figure 1 : Benchmark results comparing phi-1.5 , its version enhanced with filtered web data phi-1.5-web , and other state-of-the-art open-source LLMs. Sizes range from phi-1.5 ’s 1.3 billion parameters (Falcon-RW-1.3B [ PMH + 23 ] ) to 10x larger models like Vicuna-13B [ ZCS + 23 ] , a fine-tuned v

实验结果

研究问题

RQ11.3B 参数的 LLM 主要在合成教材风格数据上训练，在能力上能达到或超过更大模型的程度有多大？
RQ2合成/教材质量的数据是否会比网页数据降低有害内容与偏见的倾向？
RQ3在加入筛选网页数据对常识推理、编码与多步推理任务的影响是什么？
RQ4在没有指令微调或 RLHF 的情况下，小型模型是否能在自然语言任务与编码方面保持高表现？
RQ5数据质量对规模在开发可扩展、开源 LLM 时的实际意义是什么？

主要发现

phi-1.5 在常识推理与语言任务方面达到相当于数量级更大模型的水平，并在多步推理上超越了许多非前沿的 LLMs
phi-1.5-web-only 在筛选的网页数据上训练，已在常识基准上超越同等规模的模型
当以合成数据加上 phi-1 数据进行训练（phi-1-web）时，推理任务的表现接近规模更大五倍的模型
phi-1.5 展现逐步思维和 rudimentary 的上下文学习能力，但也存在与更大模型相似的幻觉与偏见；相较于网页数据基线，因合成数据的聚焦，毒性有所降低
开源 phi-1.5 使研究在上下文学习、可解释性以及减轻幻觉与偏见输出方面成为可能

Figure 2 : Safety scores computed on 13 demographics from ToxiGen [ HGP + 22 ] . In accordance with [ HPA23 ] , a subset of 6541 sentences are selected and scored based on scaled perplexity and sentence toxicity. Scores range from 0 to 1, where a higher score indicates the model is less likely to pr

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。