QUICK REVIEW

[论文解读] AlpaGasus: Training A Better Alpaca with Fewer Data

Lichang Chen, Shiyang Li|arXiv (Cornell University)|Jul 17, 2023

Video Analysis and Summarization被引用 15

一句话总结

AlpaGasus 自动从 Alpaca 的 52k 数据集中筛选高质量指令数据，使用强大的大语言模型作为自动评分器；随后对较小模型进行微调（9k 数据），其性能超越 Alpaca 并且训练更快。

ABSTRACT

Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: https://lichang-chen.github.io/AlpaGasus/

研究动机与目标

在大语言模型的指令微调（IFT）中强调数据质量胜于数量。
提出一种使用强大LLM作为评估者的自动化数据筛选方法，以提高IFT数据质量。
展示较小且高质量的子集在指令遵循任务中可以超越更大、噪声更多的数据集。

提出的方法

为强大LLM（例如 ChatGPT）定义一个评分提示，用以在准确度维度上对每个（instruction, input, response）三元组进行评分。
通过对分数进行阈值筛选，过滤 Alpaca 的 52k 数据，得到一个 9,229 样本子集，用于对 AlpaGasus 进行微调。
在经过筛选的 9k 数据上，使用相同的 Alpaca IFT 流程对基线模型（LLaMA 系列）进行训练。
使用 GPT-4 作为评审对 AlpaGasus 与 Alpaca 在多组测试集和基准上进行比较进行评估。
进行人工评估以验证模型比较，并在 Generic、Roleplay、Knowledge 和 Commonsense 等任务粒度上进行评估。

实验结果

研究问题

RQ1在数据更少的情况下，使用强大 LLM 的自动数据质量评估是否能提升指令遵循性能？
RQ2在高质量筛选数据训练下，AlpaGasus 在不同测试集和基准上与 Alpaca 及其他基线相比如何？
RQ3在不同模型大小和基础架构下，数据质量是否比数据数量对 IFT 更具影响力？
RQ4结果是否在不同的 LLM 筛选器、基础模型和数据类型（机器生成 vs 人类撰写）下具有广泛适用性？

主要发现

基于 9k 的高质量数据训练的 AlpaGasus 在四个测试集（Vicuna、Koala、Self-Instruct、WizardLM）上显著优于使用 52k 数据训练的 Alpaca。
13B 的 AlpaGasus 模型在测试任务上达到其教师 Text-Davinci-003 性能的超过 90%。
AlpaGasus 实现了 5.7x 的训练时间加速，在 4× NVIDIA A100 (80GB) GPU 上将 7B 的训练从 80 分钟缩短到 14 分钟。
以 GPT-4 作为评审的评估显示，AlpaGasus 常常在多个基准上优于 Alpaca，且人工研究支持其优越性。
筛选的数据在基础模型（LLaMA-1 与 LLaMA-2）和不同的 LLM 筛选器（ChatGPT 与 Claude-2）上具有泛化性。
数据质量筛选带来显著的成本节省和更快的迭代，同时不牺牲指令遵循性能。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。