QUICK REVIEW

[论文解读] Sheared LLaMA: Accelerating Language Model Pre-training via Structured Pruning

Mengzhou Xia, Tianyu Gao|arXiv (Cornell University)|Oct 10, 2023

Topic Modeling被引用 20

一句话总结

本文引入 LLM-shearing，一种通过对预训练的更大模型进行有针对性的结构化裁剪并随后继续预训练的方式，辅以基于数据领域的动态批量加载以在不同领域的数据之间实现平衡，从而产出更小但具竞争力的语言模型。

ABSTRACT

The popularity of LLaMA (Touvron et al., 2023a;b) and other recently emerged moderate-sized large language models (LLMs) highlights the potential of building smaller yet powerful LLMs. Regardless, the cost of training such models from scratch on trillions of tokens remains high. In this work, we study structured pruning as an effective means to develop smaller LLMs from pre-trained, larger models. Our approach employs two key techniques: (1) targeted structured pruning, which prunes a larger model to a specified target shape by removing layers, heads, and intermediate and hidden dimensions in an end-to-end manner, and (2) dynamic batch loading, which dynamically updates the composition of sampled data in each training batch based on varying losses across different domains. We demonstrate the efficacy of our approach by presenting the Sheared-LLaMA series, pruning the LLaMA2-7B model down to 1.3B and 2.7B parameters. Sheared-LLaMA models outperform state-of-the-art open-source models of equivalent sizes, such as Pythia, INCITE, OpenLLaMA and the concurrent TinyLlama models, on a wide range of downstream and instruction tuning evaluations, while requiring only 3% of compute compared to training such models from scratch. This work provides compelling evidence that leveraging existing LLMs with structured pruning is a far more cost-effective approach for building competitive small-scale LLMs

研究动机与目标

激发从现有大型预训练模型中高性价比地创建强大中等规模 LLM 的动机。
开发一种裁剪方法，使目标架构与经验证的预训练配置保持一致。
在继续预训练过程中解决跨数据领域的知识保留不平衡问题。
引入动态批量加载，根据损失下降率按领域分配数据。

提出的方法

提出有针对性的结构化裁剪，使用在层、隐藏维度、头和中间维度处的裁剪掩码，将源模型转变为指定的目标架构。
通过带有硬性具体分布的受约束优化及最小-最大目标来学习裁剪掩码，以满足目标形状。
在裁剪后的模型上执行继续预训练，以恢复或超越性能。
引入动态批量加载，在训练中根据领域的特定损失下降来调整各领域数据的比例。
采用两阶段过程：裁剪随后进行继续预训练，由面向领域的数据策略引导。

实验结果

研究问题

RQ1将一个大型预训练 LLM 裁剪成指定的目标架构，是否能够在更少的计算量下产生具有竞争力的小型 LLM？
RQ2裁剪后继续预训练是否比单独裁剪更能恢复性能？
RQ3动态批量加载是否能在领域维度上平衡损失下降，从而提升整体下游性能？

主要发现

Sheared-LLaMA 模型（1.3B 与 2.7B）在 11 项下游任务和指令微调基准测试中，超越同等规模的最先进开源模型。
从 LLaMA2-7B 裁剪到 1.3B/2.7B 仅使用约 50B 个标记进行裁剪和继续预训练，即在大约 3% 的自训计算量下实现具有竞争力的结果。
动态批量加载使各领域的损失下降对齐，并增加对更具挑战性领域的数据使用，从而提升下游性能。
在某些比较中，裁剪后的模型为继续预训练提供了比直接以同等规模现有模型作为起点更好的初始化。
有针对性的结构化裁剪在等效稀疏度下实现更高的推理吞吐量，优于非均匀裁剪方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。