QUICK REVIEW

[论文解读] WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qing‐Feng Sun|arXiv (Cornell University)|Apr 24, 2023

Topic Modeling被引用 107

一句话总结

WizardLM 证明了 AI 生成、逐步演进的指令（Evol-Instruct）可以训练 LLaMA-7B 来执行复杂的开放域任务，超越某些人类构建的指令集，在高难度情景接近 ChatGPT。GPT-4 评估显示 WizardLM 在多数技能上达到实质性等量，但在代码/数学/推理方面仍有差距。

ABSTRACT

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

研究动机与目标

证明 AI 生成的指令数据可以扩展并多样化指令跟随语言模型的训练。
表明 Evol-Instruct 生成的指令在质量和难度上可以超越人类创建的指令数据。
使用基于人类和 GPT-4 的评估，对 WizardLM 与基线和 ChatGPT 进行评估。
分析进化指令的难度、广度和质量及其对模型性能的影响。

提出的方法

提出 Evol-Instruct：两个组成部分——指令进化器（深入与广度进化）和指令消除器（筛选失败）。
迭代地将初始种子指令集进化为多代，每次生成相应的模型回答。
使用混合的进化指令对开源 LLaMA-7B 进行微调，以创建 WizardLM，数据集规模与 Vicuna 相当以实现公平比较。
通过对一个难度平衡的 Evol-Instruct 测试集和 Vicuna 测试集进行人工评估，以及 GPT-4 的自动评估来评估模型。

实验结果

研究问题

RQ1AI 生成、渐进进化的指令是否能超越人类创建的指令数据集，用于开放领域的指令跟随模型？
RQ2WizardLM 在高难度指令上的表现相比 Alpaca、Vicuna 和 ChatGPT 如何？
RQ3通过 GPT-4 评估，WizardLM 在不同技能和难度等级上的表现如何？
RQ4进化指令是否提升了超越人类撰写提示的多样性和深度？
RQ5AI 演化指令数据在未来大模型微调中的局限性及实际影响是什么？

主要发现

Evol-Instruct 指令在 Evol-Instruct 测试集的人类评估中，优于基于 ShareGPT 的人类指令。
使用 70k Evol-Instruct 数据的 WizardLM 在 Evol-Instruct 测试集和 Vicuna 测试集的人类评估中，优于 Vicuna-7B。
在高难度提示的人类判断中（属于 Evol-Instruct 高难度子集），更偏好 WizardLM 而非 ChatGPT。
GPT-4 自动评估显示 WizardLM 相对于 ChatGPT 具备实质性能力（例如在 29 项技能中的 17 项达到 >90%），并且在 Evol-Instruct 测试集上超过 Alpaca-7B 和 Vicuna-7B。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。