[論文レビュー] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
tldr: The paper builds OPT-IML Bench with 2000 NLP tasks to study instruction-tuning decisions, and trains OPT-IML 30B and 175B to achieve three levels of generalization across multiple benchmarks.
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
研究の動機と目的
- Characterize how instruction-tuning decisions affect downstream generalization when scaling model and benchmark sizes.
- Create OPT-IML Bench, a large 2000-task NLP benchmark consolidated from 8 datasets, to study cross-task, within-category, and instance-level generalization.
- Train instruction-tuned OPT-IML models (30B and 175B) and evaluate across diverse benchmarks to establish best practices.
提案手法
- Consolidate 8 instruction-tuning benchmarks into OPT-IML Bench with ~1991 tasks across 100+ categories.
- Unify instruction formats into a bipartite schema (instructions and outputs) and construct train/validation/test splits with fully held-out, partially held-out, and fully supervised evaluation settings.
- Fine-tune OPT-30B and OPT-175B with a next-token prediction objective conditioned on source (instructions/inputs) and target (labels) sequences.
- Use document-attention masking during packing of sequences to maintain per-example attention.
- Experiment with dataset mixture, task/diversity, demonstrations, and benchmark proportions to analyze effects on generalization levels.
- Share OPT-IML models and the OPT-IML Bench evaluation framework.
実験結果
リサーチクエスチョン
- RQ1How do scaling factors such as task count, benchmark diversity, and instruction formats affect generalization to held-out task categories, seen categories, and seen tasks?
- RQ2What are the trade-offs between different instruction-tuning decisions (e.g., demonstrations, reasoning data, dialogue data) when scaling model and benchmark sizes?
- RQ3How does varying the maximum task-mixing rate (EPS) impact zero-shot and few-shot performance across generalization levels?
- RQ4What is the impact of benchmark proportioning (different datasets) on cross-benchmark generalization?
主な発見
- OPT-IML improves over the base OPT model across four instruction-tuning benchmarks in zero- and few-shot scenarios.
- Using diverse benchmarks and large task coverage yields better generalization to held-out categories and tasks.
- Varying the maximum mixing rate shows benefits of EPS up to a threshold, with diminishing returns beyond certain values.
- Balancing benchmark proportions can enhance performance on held-out and partially supervised settings, underscoring the value of diverse training data.
- OPT-IML achieves competitive performance compared to models fine-tuned on individual benchmarks.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。