QUICK REVIEW

[논문 리뷰] Scaling Laws for Generative Mixed-Modal Language Models

Armen Aghajanyan, Lili Yu|arXiv (Cornell University)|2023. 01. 10.

Topic Modeling인용 수 7

한 줄 요약

이 논문은 텍스트, 음성, 이미지, 코드 등 여러 모달을 함께 모델링하는 혼합 모달 생성 언어 모델에 대한 규모 법칙을 도출하고, 모달 간 경쟁이나 시너지를 포착하는 상호 작용 항을 포함합니다; 250개 이상의 실험과 30B 매개변수의 음성-텍스트 모델에서 법칙을 검증합니다.

ABSTRACT

Generative language models define distributions over sequences of tokens that can represent essentially any combination of data modalities (e.g., any permutation of image tokens from VQ-VAEs, speech tokens from HuBERT, BPE tokens for language or code, and so on). To better understand the scaling properties of such mixed-modal models, we conducted over 250 experiments using seven different modalities and model sizes ranging from 8 million to 30 billion, trained on 5-100 billion tokens. We report new mixed-modal scaling laws that unify the contributions of individual modalities and the interactions between them. Specifically, we explicitly model the optimal synergy and competition due to data and model size as an additive term to previous uni-modal scaling laws. We also find four empirical phenomena observed during the training, such as emergent coordinate-ascent style training that naturally alternates between modalities, guidelines for selecting critical hyper-parameters, and connections between mixed-modal competition and training stability. Finally, we test our scaling law by training a 30B speech-text model, which significantly outperforms the corresponding unimodal models. Overall, our research provides valuable insights into the design and training of mixed-modal generative models, an important new class of unified models that have unique distributional properties.

연구 동기 및 목표

혼합 모달 생성 모델에서 모델 크기, 데이터, 모달리티 간 상호 작용이 성능에 어떤 영향을 미치는지 이해한다.
단일 모달 신경망의 스케일링 법칙을 다중 모달에 확장하고 추가적인 상호 작용 항을 도입한다.
단일 모달 최적화가 알려진 경우 다중 모달 설정에서 하이퍼파라미터 선택에 대한 실용적인 가이드라인을 제공한다.
모달리티 간 상호 작용과 관련된 emergent 훈련 현상 및 이를 특징짓는다.

제안 방법

Text, Image, Image-Text, Speech, Speech-Text, Code, Molecules를 나타내는 토큰 위에 단일 이산 언어 모델을 훈련한다.
Hoffmann et al. (2022)의 통합 스케일링-법칙 매개변수화에 상호 작용 항을 추가로 활용하여 모달리티 기여와 상호 작용을 모델링한다.
7개 모달리티와 8M에서 30B까지의 모델 크기를 포괄하는 250건이 넘는 실험을 수행하고, 5-100B 토큰을 사용한다.
스케일링-법칙 매개변수를 안정성 및 좌표-상승(coordinate-ascent) 역학과 같은 훈련 행동과 연결하는 경험적 관찰을 도출한다.
30B 음성-텍스트 모델을 훈련시키고 이를 단일 모달 기준선과 비교하여 스케일링 법칙을 검증한다.

Figure 1: Single modality training curves for 100B tokens across a wide range of model sizes. Different modalities exhibit wildly different training dynamics.

실험 결과

연구 질문

RQ1여러 모달리티가 함께 훈련될 때 스케일링 법칙의 형태는 무엇인가?
RQ2모달리티 간 상호 작용(경쟁 vs. 시너지)이 최적 데이터, 모델 크기 및 훈련 역학에 어떤 영향을 미치는가?
RQ3혼합 모달 스케일링 법칙이 훈련 중에 모달리티가 경쟁적으로 사라지는 구간과 시너지화되는 구간을 예측할 수 있는가?
RQ4결합된 모달 스케일링 항으로부터 어떤 실용적 하이퍼파라미터 가이드라인이 도출되는가?
RQ5대규모 학습을 가진 혼합 모달 모델이 다중 모달 작업에서 대응하는 단일 모달 모델보다 우수한가?

주요 결과

모달 간 경쟁 및 시너지를 포착하는 추가 항이 있는 혼합 모달 스케일링 법칙을 확인했다.
최적화가 모달리티 간에 자연스럽게 교대하는 emergent 좌표-상승(coordinate-ascent) 스타일의 훈련을 관찰했다.
단일 모달 최적이 알려진 경우 상호 작용 항을 기반으로 주요 하이퍼파라미터를 선택하는 가이드라인을 제시했다.
30B 음성-텍스트 모델이 해당 단일 모달 모델보다 크게 우수하다는 것을 보였다.
상호 작용 항이 모달 간 경쟁이 감소하거나 제거되는 구간(예: Speech and Text)을 정확히 예측함을 입증했다.
스케일링 법칙 매개변수를 훈련의 안정성 및 최적 배치 크기와 연결하는 경험적 현상을 보고했다.

Figure 2: Empirical scaling properties across both data and model size scale for the uni-modal setting.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.