QUICK REVIEW

[论文解读] DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining

Sang Michael Xie, Hieu Pham|arXiv (Cornell University)|May 17, 2023

Topic Modeling被引用 14

一句话总结

DoReMi 使用一个小型代理模型，结合 Group DRO 来学习预训练数据的领域权重；随后在重新加权的数据上训练一个大型语言模型，显著提高训练速度并改进下游 performance，同时无需任务特定微调。

ABSTRACT

The mixture proportions of pretraining data domains (e.g., Wikipedia, books, web text) greatly affect language model (LM) performance. In this paper, we propose Domain Reweighting with Minimax Optimization (DoReMi), which first trains a small proxy model using group distributionally robust optimization (Group DRO) over domains to produce domain weights (mixture proportions) without knowledge of downstream tasks. We then resample a dataset with these domain weights and train a larger, full-sized model. In our experiments, we use DoReMi on a 280M-parameter proxy model to set the domain weights for training an 8B-parameter model (30x larger) more efficiently. On The Pile, DoReMi improves perplexity across all domains, even when it downweights a domain. DoReMi improves average few-shot downstream accuracy by 6.5% points over a baseline model trained using The Pile's default domain weights and reaches the baseline accuracy with 2.6x fewer training steps. On the GLaM dataset, DoReMi, which has no knowledge of downstream tasks, even matches the performance of using domain weights tuned on downstream tasks.

研究动机与目标

Motivate how pretraining data domain composition affects LM performance across downstream tasks.
Develop a data-driven method to automatically set domain weights without knowledge of downstream tasks.
Demonstrate that reweighting domains with a small model can transfer to a much larger model.
Show that the method improves perplexity across domains and downstream accuracy on standard tasks.

提出的方法

Train a small reference model using initial domain weights to establish baseline difficulty per domain.
Use a proxy model trained with Group DRO to optimize domain weights by minimizing worst-case excess loss across domains (relative to the reference model).
Aggregate and average the optimized domain weights over training to obtain final domain weights.
Resample the large target model’s training data using the optimized domain weights and train the full-size model with standard training procedures.
Optionally iterate DoReMi across rounds by using the tuned weights from one round as the reference for the next.

实验结果

研究问题

RQ1Can a small proxy model, optimized with Group DRO, identify domain weights that improve a much larger LM trained later?
RQ2Do domain-weights found without downstream task knowledge generalize to downstream performance across domains?
RQ3How does DoReMi affect perplexity across individual domains and overall downstream accuracy on standard few-shot tasks?

主要发现

DoReMi improves average downstream accuracy by 6.5 percentage points on The Pile for an 8B model, compared to a baseline trained with default domain weights.
The optimized domain weights lead to lower perplexity across all domains on The Pile, even when some domains are downweighted.
DoReMi reaches baseline downstream accuracy 2.6x faster (in training steps) on The Pile.
On the GLaM dataset, iterated DoReMi achieves performance comparable to domain weights tuned on downstream tasks, without using downstream data in the optimization.
DoReMi’s gains are robust across a range of proxy-model sizes and scales of the main model.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。