QUICK REVIEW

[论文解读] Scaling Instruction-Finetuned Language Models

Hyung Won Chung, Le Hou|arXiv (Cornell University)|Oct 20, 2022

Topic Modeling被引用 1,182

一句话总结

这篇论文显示，指令微调随任务数量增加和模型规模增大而扩展，并且包含链式思考数据可显著提升推理能力，达到最先进的结果（例如 Flan-PaLM 540B）以及强大的开放式生成质量。

ABSTRACT

Finetuning language models on a collection of datasets phrased as instructions has been shown to improve model performance and generalization to unseen tasks. In this paper we explore instruction finetuning with a particular focus on (1) scaling the number of tasks, (2) scaling the model size, and (3) finetuning on chain-of-thought data. We find that instruction finetuning with the above aspects dramatically improves performance on a variety of model classes (PaLM, T5, U-PaLM), prompting setups (zero-shot, few-shot, CoT), and evaluation benchmarks (MMLU, BBH, TyDiQA, MGSM, open-ended generation). For instance, Flan-PaLM 540B instruction-finetuned on 1.8K tasks outperforms PALM 540B by a large margin (+9.4% on average). Flan-PaLM 540B achieves state-of-the-art performance on several benchmarks, such as 75.2% on five-shot MMLU. We also publicly release Flan-T5 checkpoints, which achieve strong few-shot performance even compared to much larger models, such as PaLM 62B. Overall, instruction finetuning is a general method for improving the performance and usability of pretrained language models.

研究动机与目标

通过指令基于微调来促成对未见任务的泛化。
研究微调任务数量如何影响不同模型规模的性能。
评估将链式思考数据纳入微调对推理任务的影响。
展示跨模型的指令微调在 PaLM、T5 和 U-PaLM 系列中的适用性。
评估指令微调模型的可用性与负责任 AI 方面。

提出的方法

在大规模混合的指令微调任务上对多种模型家族（T5、PaLM、U-PaLM）进行微调，总计 1,836 个任务，来自 Muffin、T0-SF、NIV2 和 CoT 数据。
使用 packing 将多个训练样本合并为一个带有结束符的序列。
在输入前置指令模板并对边界进行遮罩；使用 Adafactor 优化器和恒定学习率调度。
在评估阶段尝试 zero-shot、few-shot 和 chain-of-thought（CoT）提示设置。
结合九个数据集的专用 CoT 微调混合，手工编写 CoT 注释以研究推理影响。
在保留评测基准上进行评估（MMLU、BBH、TyDiQA、MGSM）以及开放式生成的人类评估。
比较各种模型规模（8B、62B、540B）和模型家族（Flan-T5、Flan-PaLM、cont-PaLM、U-PaLM）。

实验结果

研究问题

RQ1指令微调是否随着任务数量的增加和模型规模的扩大而带来收益？
RQ2在保留任务上，将链式思考数据纳入微调如何影响推理能力？
RQ3将 CoT 微调与非 CoT 任务结合是否会在非 CoT 任务上降低性能？
RQ4指令微调模型是否在架构和预训练目标上具备泛化性？
RQ5指令微调对开放式生成的可用性和负责任 AI 指标有何实际影响？

主要发现

指令微调在不同模型规模和提示下提供了显著的性能提升，在未见基准上的增益范围为 9.4% 至 15.5%。
增加微调任务数量会提升性能，尽管在所示的尺度上约 282 个任务后收益趋于饱和。
将模型从 8B 扩展到 540B 为微调模型和非微调模型都带来显著的性能提升。
在微调中加入九个 CoT 数据集可实现鲁棒的 CoT 推理，达到最先进的结果（例如 Flan-PaLM 540B 在 MMLU 的 CoT + Self-Consistency 75.2%）。
CoT 与非 CoT 数据的联合微调在保持非 CoT 性能的同时显著提升 CoT 性能。
CoT 提示与自一致性结合可获得强大收益，并使在具有挑战性的任务上实现零-shot CoT 推理成为可能。
Flan 模型在许多任务上优于其非指令微调的对手，包括强大的零-shot 和少量学习能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。