QUICK REVIEW

[論文レビュー] Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models

Haoran Li, Qingxiu Dong|arXiv (Cornell University)|Feb 20, 2024

Natural Language Processing Techniques被引用数 5

ひとこと要約

GLANは、人間の知識分類法から大規模な合成指示データを作成し、それを用いてLLMを指示チューニングする一般的でスケーラブルな手法であり、タスク固有の訓練データなしで強い総合的性能を達成します。

ABSTRACT

We introduce Generalized Instruction Tuning (called GLAN), a general and scalable method for instruction tuning of Large Language Models (LLMs). Unlike prior work that relies on seed examples or existing datasets to construct instruction tuning data, GLAN exclusively utilizes a pre-curated taxonomy of human knowledge and capabilities as input and generates large-scale synthetic instruction data across all disciplines. Specifically, inspired by the systematic structure in human education system, we build the taxonomy by decomposing human knowledge and capabilities to various fields, sub-fields and ultimately, distinct disciplines semi-automatically, facilitated by LLMs. Subsequently, we generate a comprehensive list of subjects for every discipline and proceed to design a syllabus tailored to each subject, again utilizing LLMs. With the fine-grained key concepts detailed in every class session of the syllabus, we are able to generate diverse instructions with a broad coverage across the entire spectrum of human knowledge and skills. Extensive experiments on large language models (e.g., Mistral) demonstrate that GLAN excels in multiple dimensions from mathematical reasoning, coding, academic exams, logical reasoning to general instruction following without using task-specific training data of these tasks. In addition, GLAN allows for easy customization and new fields or skills can be added by simply incorporating a new node into our taxonomy.

研究の動機と目的

seed-data やドメイン固有データセットを超えたLLMの指示遵守向上の動機。
人間の知識の体系化された分類から合成指示データを生成する、スケーラブルなパイプラインの提案。
GLAN生成データで訓練したモデルが数学的推論、コーディング、論理、および学術試験に優れることを示す。
新しい分類ノードを追加することでGLANがカスタマイズ可能で拡張可能であることを示す。

提案手法

最先端の LLM（GPT-4）と人間による検証を用いて、人間の知識と能力の分類法を構築する。
学問を科目に分解し、各科目のシラバスをLLMsを用いて設計する。
科目をクラスセッションに分解し、シラバスから主要概念をLLMsを用いて抽出する。
クラスセッションと主要概念をサンプリングして、LLMsで多様な宿題問題を生成する；回答はGPT-3.5-turboで生成する。
標準的なファインチューニング設定で、合成指示-回答ペアを用いて基盤モデル（Mistral 7B）を訓練する。

実験結果

リサーチクエスチョン

RQ1分類法ベースの完全自動データ生成パイプラインは、多様な領域に跨って広く有用な指示データを生成できるか？
RQ2GLAN生成データを用いた指示チューニングは、基準と比較して数学的推論・コーディング・論理・学術試験の性能を向上させるか？
RQ3分類を拡張して新しい領域を追加しても、すべてのデータを再生成せずにGLANは頑健か？
RQ4GLANデータで訓練したモデルは、タスク固有のドメイン内データなしで一般的な指示遵守を維持できるか？

主な発見

GLAN生成データは、タスク固有の訓練データを使用せずに、数学的推論、コーディング、論理的推論、および学術試験で強力な性能を生む。
GLANは、複数のベンチマーク（数学、コーディング、推論、試験）で、いくつかのベースラインと競合する、あるいは最高の結果を達成する。
分類ベースのデータ生成アプローチは、全パイプラインを再実行せずに新しいノードを追加するだけで容易に拡張できる。
GLANデータで訓練したモデルは、多様な分野に高い適応性を示し、STEM関連タスクで顕著な改善を示す。
評価は、GLANがドメイン内のベンチマークデータへの過学習を避ける多様な指示データを生成することを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。