[논문 리뷰] An Empirical Study of Mamba-based Language Models
이 연구는 8B-parameter Mamba, Mamba-2, 및 Transformer 모델들(또는 8B Mamba-2-Hybrid)을 최대 3.5T 토큰까지 학습시키고 35개 NLP 태스크와 장맥락 벤치마크에서 평가하여 확장성, 카피, 맥락 내 학습, 및 하이브리드 아키텍처를 평가한다.
Selective state-space models (SSMs) like Mamba overcome some of the shortcomings of Transformers, such as quadratic computational complexity with sequence length and large inference-time memory requirements from the key-value cache. Moreover, recent studies have shown that SSMs can match or exceed the language modeling capabilities of Transformers, making them an attractive alternative. In a controlled setting (e.g., same data), however, studies so far have only presented small scale experiments comparing SSMs to Transformers. To understand the strengths and weaknesses of these architectures at larger scales, we present a direct comparison between 8B-parameter Mamba, Mamba-2, and Transformer models trained on the same datasets of up to 3.5T tokens. We also compare these models to a hybrid architecture consisting of 43% Mamba-2, 7% attention, and 50% MLP layers (Mamba-2-Hybrid). Using a diverse set of tasks, we answer the question of whether Mamba models can match Transformers at larger training budgets. Our results show that while pure SSMs match or exceed Transformers on many tasks, they lag behind Transformers on tasks which require strong copying or in-context learning abilities (e.g., 5-shot MMLU, Phonebook) or long-context reasoning. In contrast, we find that the 8B Mamba-2-Hybrid exceeds the 8B Transformer on all 12 standard tasks we evaluated (+2.65 points on average) and is predicted to be up to 8x faster when generating tokens at inference time. To validate long-context capabilities, we provide additional experiments evaluating variants of the Mamba-2-Hybrid and Transformer extended to support 16K, 32K, and 128K sequences. On an additional 23 long-context tasks, the hybrid model continues to closely match or exceed the Transformer on average. To enable further study, we release the checkpoints as well as the code used to train our models as part of NVIDIA's Megatron-LM project.
연구 동기 및 목표
- 대규모(8B parameters, up to 3.5T tokens)에서 Mamba 기반 언어 모델이 Transformer 기준에 비해 어떤 성능을 보이는지 평가한다.
- 표준 및 장맥락 태스크에서 순수 SSM(Mamba/Mamba-2)의 강점과 약점을 조사한다.
- 효율성 이점을 유지하면서 순수 SSM 모델의 간극을 줄일 수 있는 하이브리드 Mamba-Transformer 아키텍처가 성능 격차를 좁힐 수 있는지 탐색한다.
- 재현 가능성과 추가 연구를 촉진하기 위해 벤치마크, 체크포인트 및 코드를 공개한다.]
- method([
- Direct apples-to-apples comparison by training Mamba, Mamba-2, Mamba-2-Hybrid, and Transformer models with the same data, hyperparameters, and evaluation setup.
- Evaluation on 12 standard short-context tasks and 23 long-context tasks using open benchmark suites (LM Evaluation Harness, LongBench, RULER).
- Analysis of MMLU across three formats (standard, choice-text-in-targets, and cloze) to probe in-context learning formats.
- Ablation studies to design hybrid architectures distributing Mamba-2, self-attention, and MLP layers for optimal performance.
- Investigation of long-context extensions up to 128K tokens for both pure and hybrid models.
- Release of training code and model weights via NVIDIA Megatron-LM and Hugging Face.
제안 방법
- 직접적인 동일 조건 비교를 위해 Mamba, Mamba-2, Mamba-2-Hybrid, 및 Transformer 모델을 동일한 데이터, 하이퍼파라미터, 평가 설정으로 학습시킨다.
- 오픈 벤치마크 세트(LM Evaluation Harness, LongBench, RULER)를 활용하여 12개의 표준 짧은 맥락 태스크와 23개의 긴 맥락 태스크를 평가한다.
- 맥락 내 학습 형식을 파악하기 위해 세 형식(표준, 대상 텍스트 선택, 클로즈)의 MMLU 분석.
- 최적의 성능을 위한 하이브리드 아키텍처를 설계하기 위한 제거(ablations) 연구를 수행한다.
- 순수 및 하이브리드 모델 모두에 대해 최대 128K 토큰의 장맥락 확장을 조사한다.
- NVIDIA Megatron-LM과 Hugging Face를 통해 학습 코드와 모델 가중치를 공개한다.
실험 결과
연구 질문
- RQ1Can 8B-parameter Mamba and Mamba-2 match Transformer performance on standard NLP tasks when trained on large token budgets (up to 3.5T tokens) under controlled conditions?
- RQ2What are the specific weaknesses of pure SSM models in tasks requiring in-context learning, copying, or long-context reasoning?
- RQ3Can a hybrid Mamba-Transformer architecture close the gaps observed for pure SSM models while preserving efficiency benefits during inference?
- RQ4How do long-context extensions (16K, 32K, 128K) affect the performance of pure SSM and hybrid models on standard and long-context benchmarks?
- RQ5Do Mamba-2-Hybrid architectures demonstrate practical inference speedups and scalability advantages over pure Transformers?
주요 결과
- Pure SSM models (Mamba/Mamba-2) can match or exceed Transformers on many standard tasks, but lag on MMLU (especially with shorter horizons) and copying tasks like Phonebook.
- Training Mamba-2 with 3.5T tokens substantially closes the MMLU gap to Transformer and can surpass Transformer on average in short-context benchmarks at 3.5T.
- An 8B-parameter Mamba-2-Hybrid (24 Mamba-2, 4 self-attention, 28 MLP) exceeds the 8B-parameter Transformer on all 12 short-context tasks evaluated (+2.65 average points) and can be up to 8x faster at inference for long contexts.
- Long-context extensions (16K and 32K) for the Mamba-2-Hybrid closely match or exceed Transformer baselines on average across 23 long-context tasks.
- Phonebook-style copying tasks reveal pure SSM models struggle with in-context copying beyond ~500 tokens, whereas the Transformer handles pretraining context lengths up to 4096.
- Hybrid models with distributed attention/MLP layers show strong performance, with ablations suggesting around 8% self-attention layers and 30-50% MLP layers as effective configurations; RoPE position embeddings are not essential for large hybrids and may be omitted for long contexts.
- Inference speedup: Mamba-2-Hybrid can generate tokens significantly faster than Transformers on long contexts, with practical MFU comparable to strong Transformer baselines.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.