Skip to main content
QUICK REVIEW

[论文解读] Scaling Laws for Generative Mixed-Modal Language Models

Armen Aghajanyan, Lili Yu|arXiv (Cornell University)|Jan 10, 2023
Topic Modeling被引用 7
一句话总结

本文推导了混合模态生成语言模型的扩展规律,该模型同时建模文本、语音、图像、代码等,并包含一个表征模态之间竞争或协同的交互项;在250+次实验和一个30B参数的语音-文本模型上对规律进行了验证。

ABSTRACT

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

研究动机与目标

  • 理解模型规模、数据量以及模态交互对混合模态生成模型性能的影响。
  • 将单模态神经缩放规律扩展到多模态,加入一个叠加的交互项来建模模态贡献与交互。
  • 在已知单模态最佳时,提供多模态设置下的超参数选择的实用指南。
  • 描述出现的训练现象及其与模态交互的关系。

提出的方法

  • 在代表七种模态的令牌上训练一个单一的离散语言模型(文本、图像、图像-文本、语音、语音-文本、代码、分子)。
  • 使用 Hoffmann et al. (2022) 的统一扩展规律参数化,并引入叠加的交互项来建模模态贡献与交互。
  • 进行超过250次实验,覆盖七种模态和模型规模从 8M 到 30B,且有 5-100B 令牌。
  • 推导经验观测,将扩展规律参数与训练行为(如稳定性和坐标上升动力学)联系起来。
  • 通过训练一个30B的语音-文本模型并与单模态基线比较来验证扩展规律。
Figure 1: Single modality training curves for 100B tokens across a wide range of model sizes. Different modalities exhibit wildly different training dynamics.
Figure 1: Single modality training curves for 100B tokens across a wide range of model sizes. Different modalities exhibit wildly different training dynamics.

实验结果

研究问题

  • RQ1当同时训练多种模态时,扩展规律的形式为何?
  • RQ2模态交互(竞争与协同)如何影响最佳数据量、模型规模和训练动力学?
  • RQ3混合模态扩展规律是否能预测在训练中模态发生竞争或协同的具体区间?
  • RQ4在已知单模态最优的前提下,结合模态交互项,能够提出哪些实际超参数指南?
  • RQ5具有大规模训练的混合模态模型在多模态任务上是否优于相应的单模态模型?

主要发现

  • 识别出带有叠加交互项的混合模态扩展规律,能够捕捉模态之间的竞争与协同。
  • 观察到出现的坐标上升式训练现象,优化在模态之间自然交替。
  • 在已知单模态最优时,基于交互项提供关键超参数的选取指南。
  • 显示30B语音-文本模型显著优于相应的单模态模型。
  • 证明交互项能准确预测模态竞争减弱或消除的情形(如语音与文本)。
  • 报告将扩展规律参数与训练稳定性与最佳批大小等经验现象相关联的结果。
Figure 2: Empirical scaling properties across both data and model size scale for the uni-modal setting.
Figure 2: Empirical scaling properties across both data and model size scale for the uni-modal setting.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。