[论文解读] AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model
AlexaTM 20B 是一个 20B 的多语言 seq2seq 模型,使用 denoising 和 CLM 进行预训练,在少样本学习方面表现出色,在摘要、机器翻译和多语言 NLP 任务上超越了更大的解码器模型。
In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.
研究动机与目标
- Motivate and build the largest multilingual seq2seq model capable of few-shot in-context learning.
- Show that seq2seq models can outperform large decoder-only LLMs on long-context tasks such as summarization.
- Demonstrate strong one-shot and zero-shot performance in translation across many languages, especially low-resource ones.
- Evaluate zero-shot multilingual NLP tasks and compare against prior SOTA models.
- Assess memorization, fairness, and bias to understand risks associated with the model.
提出的方法
- Pre-train AlexaTM 20B with a mix of denoising and causal language modeling (CLM) tasks across 12 languages.
- Use a standard Transformer architecture with Pre-LN to improve stability at scale.
- Train on 1 trillion token updates using Wikipedia and mC4 data, with 1024-token sequences and a 150K unigram SentencePiece tokenizer.
- Incorporate a CLM objective with a special [CLM] token to enable continuation of inputs.
- Leverage a 10B encoder pre-trained model for initialization and use DeepSpeed ZeRO-3 for scalable distributed training.
- Employ in-context learning through denoising and CLM modes, including Fusion-in-Decoder (FiD) to encode multiple shots for decoder attention.
实验结果
研究问题
- RQ1Can a large-scale multilingual seq2seq model provide effective few-shot learning across generative NLP tasks?
- RQ2How does a multilingual seq2seq model compare to larger decoder-only LLMs on long-context tasks like summarization and cross-lingual translation?
- RQ3What are the zero-shot capabilities of AlexaTM 20B on standard multilingual NLP benchmarks and English tasks, relative to existing SOTA models?
- RQ4Does seq2seq pre-training in a multilingual setting improve translation quality for low-resource languages?
- RQ5What are the memorization, fairness, and bias characteristics of a 20B multilingual seq2seq model?
主要发现
- AlexaTM 20B achieves SOTA results in 1-shot summarization, outperforming a 540B PaLM decoder model on XSUM and MLSum datasets.
- AlexaTM 20B achieves SOTA in 1-shot machine translation across Flores-101 language pairs, with notable gains for Marathi, Tamil, and Telugu.
- In zero-shot settings, AlexaTM 20B outperforms GPT-3 (175B) on SuperGLUE and SQuADv2 and achieves SOTA on multilingual tasks like XNLI, XCOPA, Paws-X, and XWinograd.
- Across multilingual NLP tasks, AlexaTM 20B provides strong zero-shot performance, often surpassing XGLM 7.5B on several benchmarks.
- On English tasks, AlexaTM 20B surpasses GPT-3 175B and is competitive with PaLM 540B on SQuADv2 and most SuperGLUE tasks, while requiring fewer parameters than the largest decoder-only models.
- Memorization analysis suggests reduced memorization with longer contexts, and bias/toxicity analyses show state-of-the-art results on Winogender in zero-shot, with toxicity influenced by prompt content.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。