QUICK REVIEW

[论文解读] The Falcon Series of Open Language Models

Ebtesam Almazrouei, Hamza Alobeidli|arXiv (Cornell University)|Nov 28, 2023

Topic Modeling被引用 112

一句话总结

Falcon 系列提供 7B、40B 与 180B 开放解码器 LLM，在大型经过筛选的网页数据集 RefinedWeb 上训练，Falcon-180B 接近 PaLM-2 Large，在某些任务上接近 GPT-3.5/4，并发布模型与数据以促进开放科学。

ABSTRACT

We introduce the Falcon series: 7B, 40B, and 180B parameters causal decoder-only models trained on a diverse high-quality corpora predominantly assembled from web data. The largest model, Falcon-180B, has been trained on over 3.5 trillion tokens of text--the largest openly documented pretraining run. Falcon-180B significantly outperforms models such as PaLM or Chinchilla, and improves upon concurrently developed models such as LLaMA 2 or Inflection-1. It nears the performance of PaLM-2-Large at a reduced pretraining and inference cost, making it, to our knowledge, one of the three best language models in the world along with GPT-4 and PaLM-2-Large. We report detailed evaluations, as well as a deep dive into the methods and custom tooling employed to pretrain Falcon. Notably, we report on our custom distributed training codebase, allowing us to efficiently pretrain these models on up to 4,096 A100s on cloud AWS infrastructure with limited interconnect. We release a 600B tokens extract of our web dataset, as well as the Falcon-7/40/180B models under a permissive license to foster open-science and accelerate the development of an open ecosystem of large language models.

研究动机与目标

展示在 7B、40B 和 180B 参数规模上，对开放解码器式的 LLM 进行可扩展的预训练。
证明经过筛选和去重的大规模网页数据，在自然语言零-shot 任务上可以与经过精心筛选的数据集相媲美。
解释为在大规模云基础设施上实现高效预训练所做的设计选择及定制的硬件/软件栈。
提供对模型和大量网页数据提取的开放访问，以促进开放科学与生态系统的发展。

提出的方法

训练三种因果解码器模型：Falcon-7B、Falcon-40B、Falcon-180B，最多使用 3,500B 令牌（RefinedWeb）。
采用自定义分布式训练，使用三维并行和优化器分片，在受限互连条件下使用多达 4,096 块 A100 GPU 运行。
采用结构性调整（如多查询、多头旋转（rotary）与 ALiBi 比较、内存节省技术等）以及严格的超参数验证。
使用以网络数据为主、并进行筛选与去重的数据管线进行预训练，以最大化质量与覆盖范围。
在宽松许可证下公开 Falcon-7B/40B/180B 以及 RefinedWeb 的 600B-token 提取，促进可重复性和开放科学。

实验结果

研究问题

RQ1高度筛选和去重的网页数据本身能够达到或超越在经过精心整理的语料库上训练的模型的零-shot 性能吗？
RQ2在强基线网页数据的基础上再加入经过整理的数据，对自然语言零-shot 性能有何影响？
RQ3有限的多语言或代码数据的添加会否显著降低英语性能，程度如何？
RQ4哪些架构/数据管线选择能最大化硬件效率与大规模预训练的可扩展性？
RQ5开放发布模型与数据如何影响开放研究与生态系统的发展？

主要发现

Falcon-180B 在广泛任务集的单次评估中显著优于 PaLM 或 Chinchilla，并接近 PaLM-2 Large。
Falcon-180B 在性能上接近 PaLM-2 Large，同时提供更低的预训练和推理成本，使其位居顶尖开放与世界级模型之列。
RefinedWeb（经筛选和去重的网页数据）在小规模的自然语言零-shot 任务中，优于经过整理的数据集（如 The Pile）以及其他网页数据集（C4、OSCAR），筛选和去重至关重要。
用整理数据替代强网页基线通常不提升零-shot 性能，甚至可能降低，尤其是在书籍/技术数据方面，而对话数据则更稳定。
引入有限的多语言或代码数据（5-10%）并不显著降低英语性能，表明在有限的多语言数据下也有稳健的跨领域迁移。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。