QUICK REVIEW

[论文解读] Branch-Train-Merge: Embarrassingly Parallel Training of Expert Language Models

Margaret Li, Suchin Gururangan|arXiv (Cornell University)|Aug 5, 2022

Topic Modeling被引用 25

一句话总结

BTM 在域特定数据上并行训练一组独立的专家语言模型，然后将它们合并或平均，以形成一个可扩展、有效的集成/单一模型，在大小和域上超越计算匹配的基线。

ABSTRACT

We present Branch-Train-Merge (BTM), a communication-efficient algorithm for embarrassingly parallel training of large language models (LLMs). We show it is possible to independently train subparts of a new class of LLMs on different subsets of the data, eliminating the massive multi-node synchronization currently required to train LLMs. BTM learns a set of independent expert LMs (ELMs), each specialized to a different textual domain, such as scientific or legal text. These ELMs can be added and removed to update data coverage, ensembled to generalize to new domains, or averaged to collapse back to a single LM for efficient inference. New ELMs are learned by branching from (mixtures of) ELMs in the current set, further training the parameters on data for the new domain, and then merging the resulting model back into the set for future use. Experiments show that BTM improves in- and out-of-domain perplexities as compared to GPT-style Transformer LMs, when controlling for training cost. Through extensive analysis, we show that these results are robust to different ELM initialization schemes, but require expert domain specialization; LM ensembles with random data splits do not perform well. We also present a study of scaling BTM into a new corpus of 64 domains (192B whitespace-separated tokens in total); the resulting LM (22.4B total parameters) performs as well as a Transformer LM trained with 2.5 times more compute. These gains grow with the number of domains, suggesting more aggressive parallelism could be used to efficiently train larger models in future work.

研究动机与目标

为大型语言模型提出一个可扩展的训练范式，减少跨节点同步，通过独立训练域专家来实现。
开发 Branch-Train-Merge (BTM) 算法，通过从现有专家分支新的专家，使用域特定数据进行训练，并将其合并回森林来扩展 ELMforest。
通过对专家进行集成或参数求平均实现灵活的推理，以在性能和推理成本之间取得平衡。
在多个模型规模和数据域上，与计算等价的 Transformer-LMs 和先前的域专门化基线进行实证比较评估。
探索对多域的可扩展性，并分析种子训练、数据来源和初始化等因素对性能和效率的影响。

提出的方法

将 ELMforest 定义为一组在训练和推理阶段完全解耦、并对数据域进行专门化的独立专家语言模型集合。
Branch-Train-Merge: 迭代地从现有专家的加权平均中分支出一个新的 ELM，在一个新域上进行训练，并将其合并回森林；以一个预训练种子语言模型来为融合过程提供初始种子。
集成推理将 ELM 的输出与域后验结合起来，以对专家的加权和来计算 p(X_t|x_<t)；稀疏性允许选择前 k 个以降低成本。
通过参数求平均的替代推理方式，通过对 ELM 参数进行加权平均来创建一个单一模型，权重由域后验或其他方案提供信息。
与计算等价的基线进行比较：Transformer-LM 和 DEMix；在 125M–1.3B 参数规模和 8 个训练域 + 8 个评估域下评估困惑度；分析种子预算和初始化的影响。

实验结果

研究问题

RQ1在域专门化专家（ELMs）上进行的极易并行训练在内域和跨域困惑度方面，是否优于计算等价的 Transformer-LMs 和以往的域基线？
RQ2Branch-Train-Merge 在扩展到许多域（如多达 64 个域）和更大参数量时，对可扩展性有何影响？
RQ3种子阶段训练、初始化和数据来源对 ELM 集成的有效性以及参数平均化的性能有何影响？
RQ4随着域数量增加，集成与参数平均在推理成本和性能方面的比较如何？
RQ5分支 ELM 训练相对于完全同步训练的效率含义（每秒更新次数、通信成本）是什么？

主要发现

模型	125M 训练	125M 评估	125M 全部	350M 训练	350M 评估	350M 全部	750M 训练	750M 评估	750M 全部	1.3B 训练	1.3B 评估	1.3B 全部
Transformer-LM	19.9	25.2	22.5	16.3	20.8	18.5	14.7	19.3	17.0	14.2	18.4	16.3
DEMix	18.2	23.4	20.8	15.0	19.9	17.5	13.5	17.7	15.6	13.7	17.6	15.6
ELMforest	17.2	22.4	19.8	14.7	18.6	16.7	13.4	16.7	15.0	13.0	16.3	14.6

在多个模型规模（125M、350M、750M、1.3B 参数）下，BTM 训练的 ELMs 超越了计算等价的 Transformer-LMs 和先前的 DEMix 基线。
BTM 在每秒更新次数方面高于完全同步的基线，随着模型变大，因减少跨 GPU 通信而带来效率提升。
基于域来源的数据进行域专门化至关重要；随机数据划分的表现不及面向域的划分。
ELMforest 的参数平均可以达到类似于集成的性能，而不会带来额外的推理成本，尽管在域数增加时，集成仍然更强。
扩展到 64 个域时，ELMforest 的性能达到相当于用 2.5 倍计算训练的 Transformer-LM 的参数量级；随着域的增加，收益增长。
种子训练对有效平均和稳健性能至关重要；最优种子预算通常在总计算的约 40–60% 左右，并在广泛范围内具有鲁棒性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。