QUICK REVIEW

[論文レビュー] Scaling Laws for Generative Mixed-Modal Language Models

Armen Aghajanyan, Lili Yu|arXiv (Cornell University)|Jan 10, 2023

Topic Modeling被引用数 7

ひとこと要約

この論文は、テキスト、音声、画像、コードなどを同時にモデル化する混合モーダル生成言語モデルのスケーリング法則を導出し、モーダリティ間の競合または相乗効果を捉える相互作用項を含む。250件以上の実験と30Bパラメータの音声-テキストモデルで法則性を検証する。

ABSTRACT

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

研究の動機と目的

混合モーダル生成モデルにおいて、モデルサイズ、データ量、モダリティ間の相互作用が性能に与える影響を理解する。
単モーダルのニューラルスケーリング法則を複数モダリティへ拡張し、加法的な相互作用項を用いる。
単モーダルの最適解が既知である場合に、マルチモーダル設定でのハイパーパラメータ選択に関する実用的な指針を提供する。
モデルトレーニングにおける顕在化する現象と、それがモーダリティ間の相互作用とどう関連するかを特徴づける。

提案手法

7つのモダリティ（Text, Image, Image-Text, Speech, Speech-Text, Code, Molecules）を表すトークン上で単一の離散言語モデルを訓練する。
Hoffmann et al. (2022) の統一スケーリング法則パラメタ化を用い、モダリティの寄与と相互作用をモデル化する加法的相互作用項を導入する。
7モダリティとモデルサイズ8M〜30B、トークン数5-100Bの120超の実験を実施する。
スケーリング法則のパラメータと、トレーニングの安定性や座標上昇ダイナミクスといった学習挙動との関係を経験的に観察する。
30Bの音声-テキストモデルを訓練して、単モーダルベースラインと比較することで法則を検証する。

Figure 1: Single modality training curves for 100B tokens across a wide range of model sizes. Different modalities exhibit wildly different training dynamics.

実験結果

リサーチクエスチョン

RQ1複数のモダリティを同時に訓練したときのスケーリング法則の形はどうなるか。
RQ2モダリティ間の相互作用（競合か相乗か）は、最適なデータ量、モデルサイズ、トレーニングダイナミクスにどう影響するか。
RQ3混合モーダルのスケーリング法則は、訓練中にモダリティが競合していく別の領域と相乗的になる領域を予測できるか。
RQ4単モーダルの最適解が分かっている場合、相互作用項に基づく実用的なハイパーパラメータの指針は何か。
RQ5大規模トレーニングを行った混合モーダルモデルは、対応する単モーダルモデルをマルチモーダルタスクで上回るか。

主な発見

競合と相乗を捉える加法的相互作用項を含む混合モーダルのスケーリング法則を特定した。
最適化がモダリティ間で自然に交互に進行する、座標上昇型のトレーニングが現れることを観察した。
単モーダルの最適解が既知の場合の相互作用項に基づく主要ハイパーパラメータ選択ガイドラインを提供した。
30Bの音声-テキストモデルが対応する単モーダルモデルを大幅に上回ることを示した。
相互作用項が、モダリティ競合が低減または解消される領域（例：SpeechとText）を予測することを示した。
スケーリング法則のパラメータと、トレーニングの安定性および最適なバッチサイズとの経験的現象を報告した。

Figure 2: Empirical scaling properties across both data and model size scale for the uni-modal setting.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。