QUICK REVIEW

[论文解读] OPT-IML: Scaling Language Model Instruction Meta Learning through the Lens of Generalization

Srinivasan Iyer, Xi Victoria Lin|arXiv (Cornell University)|Dec 22, 2022

Topic Modeling被引用 85

一句话总结

本文以 2000 个 NLP 任务构建 OPT-IML Bench，用于研究指令微调决策，并训练 OPT-IML 30B 和 175B，以在多个基准上实现三个层次的泛化。

ABSTRACT

Recent work has shown that fine-tuning large pre-trained language models on a collection of tasks described via instructions, a.k.a. instruction-tuning, improves their zero and few-shot generalization to unseen tasks. However, there is a limited understanding of the performance trade-offs of different decisions made during the instruction-tuning process. These decisions include the scale and diversity of the instruction-tuning benchmark, different task sampling strategies, fine-tuning with and without demonstrations, training using specialized datasets for reasoning and dialogue, and finally, the fine-tuning objectives themselves. In this paper, we characterize the effect of instruction-tuning decisions on downstream task performance when scaling both model and benchmark sizes. To this end, we create OPT-IML Bench: a large benchmark for Instruction Meta-Learning (IML) of 2000 NLP tasks consolidated into task categories from 8 existing benchmarks, and prepare an evaluation framework to measure three types of model generalizations: to tasks from fully held-out categories, to held-out tasks from seen categories, and to held-out instances from seen tasks. Through the lens of this framework, we first present insights about instruction-tuning decisions as applied to OPT-30B and further exploit these insights to train OPT-IML 30B and 175B, which are instruction-tuned versions of OPT. OPT-IML demonstrates all three generalization abilities at both scales on four different evaluation benchmarks with diverse tasks and input formats -- PromptSource, FLAN, Super-NaturalInstructions, and UnifiedSKG. Not only does it significantly outperform OPT on all benchmarks but is also highly competitive with existing models fine-tuned on each specific benchmark. We release OPT-IML at both scales, together with the OPT-IML Bench evaluation framework.

研究动机与目标

描述指令微调决策在扩大模型和基准规模时对下游泛化的影响。
创建 OPT-IML Bench，这是一个从 8 个数据集整合而成的包含 2000 个任务的大型 NLP 基准，用于研究跨任务、同一类别内以及实例级别的泛化。
训练指令微调的 OPT-IML 模型（30B 和 175B），并在多样化基准上评估以确立最佳实践。

提出的方法

将 8 个指令微调基准整合到 OPT-IML Bench，覆盖 ~1991 个任务，跨 100+ 类别。
将指令格式统一为双分区模式（指令和输出），并构建带有完全保留、部分保留和完全监督评估设置的训练/验证/测试划分。
以下一词预测目标对源（指令/输入）和目标（标签）序列进行条件化，微调 OPT-30B 与 OPT-175B。
在序列打包时使用文档注意力屏蔽以维持每个示例的注意力。
在数据集混合、任务多样性、示例演示和基准比例上进行实验，以分析对泛化水平的影响。
共享 OPT-IML 模型和 OPT-IML Bench 评估框架。

实验结果

研究问题

RQ1任务数量、基准多样性和指令格式等扩展因素如何影响对待定类、已见类别和已见任务的泛化？
RQ2在扩大模型和基准规模时，不同指令微调决策（如示例、推理数据、对话数据）之间存在哪些权衡？
RQ3改变最大任务混合率（EPS）如何影响跨不同泛化水平的零样本和少样本性能？
RQ4基准比例分配（不同数据集）对跨基准泛化有什么影响？

主要发现

在零样本和少样本情境下，OPT-IML 相较于基础 OPT 模型在四个指令微调基准上均有所提升。
使用多样化的基准和较大任务覆盖范围可提升对 held-out 类别和任务的泛化。
改变最大混合率显示在阈值内对 EPS 有益，但在某些数值之后收益递减。
平衡基准比例可以提升对 held-out 和部分监督设置的性能，凸显多样化训练数据的价值。
OPT-IML 与在单个基准上微调的模型相比，达到具有竞争力的性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。