QUICK REVIEW

[论文解读] Multi-Head Mixture-of-Experts

Xun Wu, Shaohan Huang|arXiv (Cornell University)|Apr 23, 2024

Speech and dialogue systems被引用 5

一句话总结

MH-MoE 引入一种多头令牌拆分机制，将子词路由到多个专家以实现更密集的激活和更精细的理解，在无额外成本的情况下提升了在语言与多模态任务上的相对于 SMoE 基线的性能。

ABSTRACT

Sparse Mixtures of Experts (SMoE) scales model capacity without significant increases in training and inference costs, but exhibits the following two issues: (1) Low expert activation, where only a small subset of experts are activated for optimization. (2) Lacking fine-grained analytical capabilities for multiple semantic concepts within individual tokens. We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. These sub-tokens are then assigned to and processed by a diverse set of experts in parallel, and seamlessly reintegrated into the original token form. The multi-head mechanism enables the model to collectively attend to information from various representation spaces within different experts, while significantly enhances expert activation, thus deepens context understanding and alleviate overfitting. Moreover, our MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance. Extensive experimental results across three tasks: English-focused language modeling, Multi-lingual language modeling and Masked multi-modality modeling tasks, demonstrate the effectiveness of MH-MoE.

研究动机与目标

解决 Sparse Mixtures of Experts (SMoE) 在使用较少专家时浪费容量，导致专家激活不足的问题。
通过将子词分布到多个专家以在令牌内实现更细粒度的语义理解。
在增加实际利用的专家数量的同时，保持计算和参数效率。
在英语聚焦、多语种以及带掩码的多模态建模任务中证明有效性。

提出的方法

应用一个多头层对输入令牌进行投影，并将每个令牌拆分为 h 个子令牌。
使用门控机制将子令牌路由到前 k 个激活的专家，以实现更密集的激活。
通过 token-splitting-merging (TSM) 操作和一个合并层将子令牌输出合并，以在后续层中不产生额外成本地生成最终令牌表示。
将训练损失与负载均衡项结合，以缓解专家使用偏斜。
保持与现有 SMoE 框架的兼容，并对如 X-MoE 这样的骨干网络仅做最小改动地实现。

实验结果

研究问题

RQ1在不增加计算成本的前提下，MH-MoE 是否能实现比标准 SMoE 更密集的专家激活？
RQ2多头令牌拆分是否能在语言和模态之间实现更细粒度的语义理解？
RQ3在英语聚焦、跨语言以及带掩码的多模态预训练任务中，MH-MoE 相较 Dense 和 X-MoE 基线的表现如何？
RQ4头数和 MLP/TSM 组件数量对性能和激活模式有何影响？

主要发现

与 SMoE 相比，MH-MoE 能显著增加专家激活（在某些设置中激活高达 90.71%）。
通过将子令牌分布到不同的专家，MH-MoE 实现了更细粒度的令牌理解，提升表示学习。
在英语聚焦、跨语言和带掩码的多模态任务中，MH-MoE 的表现优于 Dense 和 X-MoE 基线，在多种设置下具有更低的困惑度和更高的下游准确性。
将头数增加到一个最优范围（大约 4–6）可提升性能；超出该范围可能导致语义内容被稀释。
Token-Splitting-Merging (TSM) 与 MLP 层的结合是实现超越简单 TS 或 MLP 单独使用的显著提升的必要条件。
MH-MoE 展示出更好的可扩展性，能够实现更密集的激活并提高可用专家利用的上限（下游任务中最高可达 256 个专家）。
在视觉-语言任务（VQA、NLVR2、COCO Captioning）中，MH-MoE 始终优于 X-MoE 和 Dense 基线。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。