Skip to main content
QUICK REVIEW

[论文解读] AlexaTM 20B: Few-Shot Learning Using a Large-Scale Multilingual Seq2Seq Model

Saleh Soltan, Shankar Ananthakrishnan|arXiv (Cornell University)|Aug 2, 2022
Topic Modeling被引用 38
一句话总结

AlexaTM 20B 是一个 20B 的多语言 seq2seq 模型,使用 denoising 和 CLM 进行预训练,在少样本学习方面表现出色,在摘要、机器翻译和多语言 NLP 任务上超越了更大的解码器模型。

ABSTRACT

In this work, we demonstrate that multilingual large-scale sequence-to-sequence (seq2seq) models, pre-trained on a mixture of denoising and Causal Language Modeling (CLM) tasks, are more efficient few-shot learners than decoder-only models on various tasks. In particular, we train a 20 billion parameter multilingual seq2seq model called Alexa Teacher Model (AlexaTM 20B) and show that it achieves state-of-the-art (SOTA) performance on 1-shot summarization tasks, outperforming a much larger 540B PaLM decoder model. AlexaTM 20B also achieves SOTA in 1-shot machine translation, especially for low-resource languages, across almost all language pairs supported by the model (Arabic, English, French, German, Hindi, Italian, Japanese, Marathi, Portuguese, Spanish, Tamil, and Telugu) on Flores-101 dataset. We also show in zero-shot setting, AlexaTM 20B outperforms GPT3 (175B) on SuperGLUE and SQuADv2 datasets and provides SOTA performance on multilingual tasks such as XNLI, XCOPA, Paws-X, and XWinograd. Overall, our results present a compelling case for seq2seq models as a powerful alternative to decoder-only models for Large-scale Language Model (LLM) training.

研究动机与目标

  • Motivate and build the largest multilingual seq2seq model capable of few-shot in-context learning.
  • Show that seq2seq models can outperform large decoder-only LLMs on long-context tasks such as summarization.
  • Demonstrate strong one-shot and zero-shot performance in translation across many languages, especially low-resource ones.
  • Evaluate zero-shot multilingual NLP tasks and compare against prior SOTA models.
  • Assess memorization, fairness, and bias to understand risks associated with the model.

提出的方法

  • Pre-train AlexaTM 20B with a mix of denoising and causal language modeling (CLM) tasks across 12 languages.
  • Use a standard Transformer architecture with Pre-LN to improve stability at scale.
  • Train on 1 trillion token updates using Wikipedia and mC4 data, with 1024-token sequences and a 150K unigram SentencePiece tokenizer.
  • Incorporate a CLM objective with a special [CLM] token to enable continuation of inputs.
  • Leverage a 10B encoder pre-trained model for initialization and use DeepSpeed ZeRO-3 for scalable distributed training.
  • Employ in-context learning through denoising and CLM modes, including Fusion-in-Decoder (FiD) to encode multiple shots for decoder attention.

实验结果

研究问题

  • RQ1Can a large-scale multilingual seq2seq model provide effective few-shot learning across generative NLP tasks?
  • RQ2How does a multilingual seq2seq model compare to larger decoder-only LLMs on long-context tasks like summarization and cross-lingual translation?
  • RQ3What are the zero-shot capabilities of AlexaTM 20B on standard multilingual NLP benchmarks and English tasks, relative to existing SOTA models?
  • RQ4Does seq2seq pre-training in a multilingual setting improve translation quality for low-resource languages?
  • RQ5What are the memorization, fairness, and bias characteristics of a 20B multilingual seq2seq model?

主要发现

  • AlexaTM 20B achieves SOTA results in 1-shot summarization, outperforming a 540B PaLM decoder model on XSUM and MLSum datasets.
  • AlexaTM 20B achieves SOTA in 1-shot machine translation across Flores-101 language pairs, with notable gains for Marathi, Tamil, and Telugu.
  • In zero-shot settings, AlexaTM 20B outperforms GPT-3 (175B) on SuperGLUE and SQuADv2 and achieves SOTA on multilingual tasks like XNLI, XCOPA, Paws-X, and XWinograd.
  • Across multilingual NLP tasks, AlexaTM 20B provides strong zero-shot performance, often surpassing XGLM 7.5B on several benchmarks.
  • On English tasks, AlexaTM 20B surpasses GPT-3 175B and is competitive with PaLM 540B on SQuADv2 and most SuperGLUE tasks, while requiring fewer parameters than the largest decoder-only models.
  • Memorization analysis suggests reduced memorization with longer contexts, and bias/toxicity analyses show state-of-the-art results on Winogender in zero-shot, with toxicity influenced by prompt content.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。