QUICK REVIEW

[論文レビュー] WizardLM: Empowering large pre-trained language models to follow complex instructions

Can Xu, Qing‐Feng Sun|arXiv (Cornell University)|Apr 24, 2023

Topic Modeling被引用数 107

ひとこと要約

WizardLM は AI が生成し、進化的に改良された指示 (Evol-Instruct) が LLaMA-7B を複雑なオープンドメイン課題に従わせる訓練に用いられ、一部の人間作成の指示セットを上回り、難易度の高いシナリオで ChatGPT に近づくことを示している。GPT-4 の評価では WizardLM が多くの技能で実質的な同等性を達成する一方、コード/数学/推論にはギャップが残る。

ABSTRACT

Training large language models (LLMs) with open-domain instruction following data brings colossal success. However, manually creating such instruction data is very time-consuming and labor-intensive. Moreover, humans may struggle to produce high-complexity instructions. In this paper, we show an avenue for creating large amounts of instruction data with varying levels of complexity using LLM instead of humans. Starting with an initial set of instructions, we use our proposed Evol-Instruct to rewrite them step by step into more complex instructions. Then, we mix all generated instruction data to fine-tune LLaMA. We call the resulting model WizardLM. Human evaluations on a complexity-balanced test bed and Vicuna's testset show that instructions from Evol-Instruct are superior to human-created ones. By analyzing the human evaluation results of the high complexity part, we demonstrate that outputs from our WizardLM are preferred to outputs from OpenAI ChatGPT. In GPT-4 automatic evaluation, WizardLM achieves more than 90\% capacity of ChatGPT on 17 out of 29 skills. Even though WizardLM still lags behind ChatGPT in some aspects, our findings suggest that fine-tuning with AI-evolved instructions is a promising direction for enhancing LLMs. Our code and data are public at https://github.com/nlpxucan/WizardLM

研究の動機と目的

AI によって生成された指示データが、指示追従言語モデルの訓練をスケールさせ、多様化できることを示す。
Evol-Instruct が生成した指示が、質と難易度の点で人間が作成した指示データを上回ることを示す。
人間評価とGPT-4ベースの評価を用いて、基準モデルとChatGPTに対する WizardLM を評価する。
進化した指示の難易度・幅・質と、それらがモデルの性能に及ぼす影響を分析する。

提案手法

Evol-Instruct を提案する: 2 つの構成要素 — Instructions Evolver（深さと幅を両立して進化）と Instruction Eliminator（失敗をフィルタリング）.
初期シード指示セットを反復的に進化させ、複数世代を生み出し、それぞれの世代で対応するモデル応答を生成する。
進化した指示の混合を用いてオープンソースの LLaMA-7B をファインチューンし WizardLM を作成する。公正な比較のため Vicuna に相当するデータセット規模を使用する。
難易度をバランスさせた Evol-Instruct テストセットと Vicuna のテストセット、さらに GPT-4 自動評価を用いてモデルを評価する。

実験結果

リサーチクエスチョン

RQ1AI が生成し、逐次進化した指示は、オープンドメインの指示追従モデル向けの人間作成指示データセットを上回ることができるか。
RQ2難易度の高い指示において WizardLM は Alpaca、Vicuna、ChatGPT とどう比較されるか。
RQ3GPT-4 で評価したとき、 WizardLM のさまざまな技能と難易度レベルにおける性能はどうか。
RQ4進化した指示は、人間が作成したプロンプトを超える多様性と深さを高めるか。
RQ5将来の LLM ファインチューニングのための AI 進化指示データの限界と実用的影響は何か。

主な発見

Evol-Instruct 指示は Evol-Instruct テストセットの人間評価で ShareGPT ベースの人間指示を上回る。
70k Evol-Instruct データを用いた WizardLM は Evol-Instruct テストセットと Vicuna テストセットの人間評価で Vicuna-7B を上回る。
難易度の高いプロンプトで人間裁定において WizardLM が ChatGPT より好まれる（Evol-Instruct の高難易度サブセット内で）。
GPT-4 自動評価では WizardLM が ChatGPT に対して substantial capacity を達成していることを示し（例：29 の技能のうち 17 で >90%）、Evol-Instruct テストセットで Alpaca-7B および Vicuna-7B を上回る。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。