QUICK REVIEW

[論文レビュー] AlpaGasus: Training A Better Alpaca with Fewer Data

Lichang Chen, Shiyang Li|arXiv (Cornell University)|Jul 17, 2023

Video Analysis and Summarization被引用数 15

ひとこと要約

AlpaGasus は Alpaca の 52k データセットから高品質な指示データを自動で抽出し、強力な LLM を自動評価者として用いてフィルタリングします。次に、より小さなモデル（9k データ）をファインチューニングし、Alpaca を上回り、より速いトレーニングを実現します。

ABSTRACT

Large language models (LLMs) strengthen instruction-following capability through instruction-finetuning (IFT) on supervised instruction/response data. However, widely used IFT datasets (e.g., Alpaca's 52k data) surprisingly contain many low-quality instances with incorrect or irrelevant responses, which are misleading and detrimental to IFT. In this paper, we propose a simple and effective data selection strategy that automatically identifies and filters out low-quality data using a strong LLM (e.g., ChatGPT). To this end, we introduce AlpaGasus, which is finetuned on only 9k high-quality data filtered from the 52k Alpaca data. AlpaGasus significantly outperforms the original Alpaca as evaluated by GPT-4 on multiple test sets and the controlled human evaluation. Its 13B variant matches $>90\%$ performance of its teacher LLM (i.e., Text-Davinci-003 generating the 52k data) on test tasks. It also provides 5.7x faster training, reducing the training time for a 7B variant from 80 minutes (for Alpaca) to 14 minutes. Moreover, the experiments prove the efficacy of our method across diverse datasets, base models, and LLM filters. Overall, AlpaGasus demonstrates a novel data-centric IFT paradigm that can be generally applied to instruction-tuning data, leading to faster training and better instruction-following models. Our project page is available at: https://lichang-chen.github.io/AlpaGasus/

研究の動機と目的

LLMs の instruction fine-tuning (IFT) において、量より質を重視する動機づけ。
評価者として強力な LLM を用いた自動データフィルタリング手法を提案し、IFT データ品質を向上させる。
小さく高品質なサブセットが、より大きくノイズの多いデータセットよりも指示遵守タスクで上回ることを示す。

提案手法

強力な LLM（例: ChatGPT）に対して、各 (instruction, input, response) の三つ組を accuracy 次元で評価する採点プロンプトを定義する。
スコアに閾値を適用して Alpaca の 52k データをフィルタリングし、AlpaGasus がファインチューニングされる 9,229 サンプルのサブセットを得る。
フィルタ済みの 9k データを用いて、同じ Alpaca IFT パイプラインでベースモデル（LLaMA-series）を訓練する。
GPT-4 を審判として用い、複数のテストセットとベンチマークで AlpaGasus と Alpaca を比較評価する。
Generic、Roleplay、Knowledge、Commonsense のタスク粒度を評価し、モデル比較を検証するための人間評価を実施する。

実験結果

リサーチクエスチョン

RQ1自動データ品質評価が、少量データでモデルをファインチューニングした場合の instruction-following 性能を改善できるか？
RQ2高品質にフィルタリングされたデータで訓練したとき、AlpaGasus は Alpaca や他のベースラインと diverse test sets および benchmarks においてどう比較されるか？
RQ3データ品質が、モデルサイズやベースアーキテクチャを超えた IFT にとって量よりも影響力があるか？
RQ4異なる LLM フィルター、ベースモデル、データタイプ（機械生成 vs 人間作成）で結果は一般化するか？

主な発見

AlpaGasus trained on 9k high-quality data significantly outperforms Alpaca trained on 52k data on four test sets (Vicuna, Koala, Self-Instruct, WizardLM).
The 13B AlpaGasus model reaches over 90% of the performance of its teacher Text-Davinci-003 on test tasks.
AlpaGasus achieves 5.7x faster training time, reducing 7B training from 80 minutes to 14 minutes on 4× NVIDIA A100 (80GB) GPUs.
Evaluation with GPT-4 as judge shows AlpaGasus often outperforms Alpaca across multiple benchmarks, with human studies corroborating its superiority.
Filtered data generalizes across base models (LLaMA-1 and LLaMA-2) and different LLM filters (ChatGPT and Claude-2).
Data quality filtering yields meaningful cost savings and faster iteration without sacrificing instruction-following performance.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。