QUICK REVIEW

[论文解读] Jamba: A Hybrid Transformer-Mamba Language Model

Opher Lieber, Barak Lenz|arXiv (Cornell University)|Mar 28, 2024

Language, Discourse, Communication Strategies被引用 40

一句话总结

Jamba 引入了一种混合 Transformer-Mamba 的专家模型架构，该架构将 Transformer 与 Mamba 层通过 MoE 交错，以在单个 80GB GPU 上实现高性能和长上下文能力。

ABSTRACT

We present Jamba, a new base large language model based on a novel hybrid Transformer-Mamba mixture-of-experts (MoE) architecture. Specifically, Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families. MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable. This flexible architecture allows resource- and objective-specific configurations. In the particular configuration we have implemented, we end up with a powerful model that fits in a single 80GB GPU. Built at large scale, Jamba provides high throughput and small memory footprint compared to vanilla Transformers, and at the same time state-of-the-art performance on standard language model benchmarks and long-context evaluations. Remarkably, the model presents strong results for up to 256K tokens context length. We study various architectural decisions, such as how to combine Transformer and Mamba layers, and how to mix experts, and show that some of them are crucial in large scale modeling. We also describe several interesting properties of these architectures which the training and evaluation of Jamba have revealed, and plan to release checkpoints from various ablation runs, to encourage further exploration of this novel architecture. We make the weights of our implementation of Jamba publicly available under a permissive license.

研究动机与目标

研究将 Transformer 层和 Mamba 层交错是否能够结合注意力模型和状态空间模型的优点。
评估 MoE 集成在混合架构中对容量、吞吐量和内存的影响。
在标准基准测试和长上下文任务上评估性能，包括 256K-token 上下文。
展示在商用硬件上部署一个 7B-12B 参数规模的混合模型的训练稳定性和实用性。

提出的方法

定义一个 Jamba 块，结合 Transformer 或 Mamba 层，随后是 MLP 或 MoE 模块。
以 a:m 的 attention-to-Mamba 比例交错块，并在每 e 层应用 MoE，具备 n 个总专家和每个标记的 top-K 路由。
在 Mamba 层中使用 RMSNorm，并省略显式的位置嵌入，依赖混合结构提供隐式位置信息。
在包含 64K 词汇表和 BPE 分词器的大规模数据上训练，优化在 80GB GPU 设置上的吞吐量和内存效率。
在学术基准、长上下文问答数据集以及跨不同硬件规模的吞吐量测量上进行评估。

实验结果

研究问题

RQ1混合的 Attention-Mamba 架构是否能够在标准基准上达到或超过同等规模的纯 Transformer 模型？
RQ2将 MoE 引入混合架构是否在不产生过高计算成本的前提下提高容量？
RQ3Attention 与 Mamba 层的比例如何影响内存使用、吞吐量和长上下文性能？
RQ4Jamba 是否能够在合理的 KV 缓存要求下有效处理非常长的上下文（高达 256K tokens）？
RQ5针对大规模混合模型的实际训练稳定性考虑因素有哪些？

主要发现

可用参数	活动参数	KV 缓存 (256K 上下文，16 位)
LLAMA-2	6.7B	6.7B	128GB
Mistral	7.2B	7.2B	32GB
Mixtral	46.7B	12.9B	32GB
Jamba	52B	12B	4GB

与 Mixtral 和 Llama-2 70B 等同等规模的开源模型相比，Jamba 在标准基准测试中达到有竞争力或更高的准确度。
混合的 Attention-Mamba 架构在 256K 上下文时将 KV 缓存需求降至 4GB，使在单个 80GB GPU 上进行长上下文处理成为可能。
MoE 变体在大规模（7B 参数在 50B token 上训练）下的表现优于非 MoE 的混合架构。
注意力- Mamba 混合在若干任务上优于纯 Mamba，并支持类似 Transformer 的上下文学习，表明互补优势。
显式位置信息不是 Jamba 所必需的，因为 Mamba-先结构提供隐式位置信息。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。