[论文解读] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization
本文以 2000 个 NLP 任务构建 OPT-IML Bench,用于研究指令微调决策,并训练 OPT-IML 30B 和 175B,以在多个基准上实现三个层次的泛化。
Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.
研究动机与目标
- 描述指令微调决策在扩大模型和基准规模时对下游泛化的影响。
- 创建 OPT-IML Bench,这是一个从 8 个数据集整合而成的包含 2000 个任务的大型 NLP 基准,用于研究跨任务、同一类别内以及实例级别的泛化。
- 训练指令微调的 OPT-IML 模型(30B 和 175B),并在多样化基准上评估以确立最佳实践。
提出的方法
- 将 8 个指令微调基准整合到 OPT-IML Bench,覆盖 ~1991 个任务,跨 100+ 类别。
- 将指令格式统一为双分区模式(指令和输出),并构建带有完全保留、部分保留和完全监督评估设置的训练/验证/测试划分。
- 以下一词预测目标对源(指令/输入)和目标(标签)序列进行条件化,微调 OPT-30B 与 OPT-175B。
- 在序列打包时使用文档注意力屏蔽以维持每个示例的注意力。
- 在数据集混合、任务多样性、示例演示和基准比例上进行实验,以分析对泛化水平的影响。
- 共享 OPT-IML 模型和 OPT-IML Bench 评估框架。
实验结果
研究问题
- RQ1任务数量、基准多样性和指令格式等扩展因素如何影响对待定类、已见类别和已见任务的泛化?
- RQ2在扩大模型和基准规模时,不同指令微调决策(如示例、推理数据、对话数据)之间存在哪些权衡?
- RQ3改变最大任务混合率(EPS)如何影响跨不同泛化水平的零样本和少样本性能?
- RQ4基准比例分配(不同数据集)对跨基准泛化有什么影响?
主要发现
- 在零样本和少样本情境下,OPT-IML 相较于基础 OPT 模型在四个指令微调基准上均有所提升。
- 使用多样化的基准和较大任务覆盖范围可提升对 held-out 类别和任务的泛化。
- 改变最大混合率显示在阈值内对 EPS 有益,但在某些数值之后收益递减。
- 平衡基准比例可以提升对 held-out 和部分监督设置的性能,凸显多样化训练数据的价值。
- OPT-IML 与在单个基准上微调的模型相比,达到具有竞争力的性能。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。