QUICK REVIEW

[論文レビュー] LAB: Large-Scale Alignment for ChatBots

Shivchander Sudalairaj, Abhishek Bhandwaldar|arXiv (Cornell University)|Mar 2, 2024

Topic Modeling被引用数 5

ひとこと要約

LAB は、GPT-4 を使わずに整列を拡張するための分類法に基づく合成データ生成と多段階の指示チューニングフレームワークを導入し、競合的なベンチマークを達成します。

ABSTRACT

This work introduces LAB (Large-scale Alignment for chatBots), a novel methodology designed to overcome the scalability challenges in the instruction-tuning phase of large language model (LLM) training. Leveraging a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB significantly reduces reliance on expensive human annotations and proprietary models like GPT-4. We demonstrate that LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data. Thus offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting, marking a step forward in the efficient training of LLMs for a wide range of applications.

研究の動機と目的

Motivate scalable instruction tuning without heavy reliance on human annotations or proprietary models.
Propose a taxonomy-guided synthetic data generation process to diversify instruction data.
Develop a multi-phase training framework with replay buffers to prevent catastrophic forgetting.
Show that LAB-trained models achieve competitive performance on standard benchmarks.

提案手法

Define a taxonomy with branches for knowledge, foundational skills, and compositional skills to curate instruction data.
Use taxonomy-guided synthetic data generators to create large-scale diverse instruction data without GPT-4 or extensive human curation.
Implement a two-phase training regime (knowledge tuning followed by skills tuning) with a replay buffer to mitigate forgetting.
Evaluate using LMSYS benchmarks (MT-Bench, MMLU, ARC, HellaSwag, Winogrande, GSM8K) and compare against baselines.

実験結果

リサーチクエスチョン

RQ1Can taxonomy-guided synthetic data generation reduce reliance on proprietary models while maintaining instruction-following performance?
RQ2Does a multi-phase training regime with replay buffers improve stability and prevent forgetting during large-scale alignment?
RQ3How do LAB-trained models perform on a comprehensive set of alignment benchmarks compared to human-annotated or GPT-4–generated data models?

主な発見

Model	Alignment	Teacher	MT-Bench	MMLU	ARC	HellaSwag	Winogrande	GSM8K
Llama-2-13b-chat	SFT + RLHF	Human annotators	6.65	54.58	59.81	82.52	75.93	34.80
Orca-2	Progressive Training	GPT-4	6.15	60.37	59.73	79.86	78.22	48.22
WizardLM-13B	Evol- Instruct	GPT-4	7.20	54.83	60.24	82.62	76.40	43.75
Labradorite-13b	LAB	Mixtral-8x7B- Instruct	7.23	58.89	61.69	83.15	79.56	40.11
Mistral-7B-Instruct	SFT	Public Datasets	6.84	60.37	63.65	84.76	76.80	41.85
Zephyr-7b-β	SFT + DPO	GPT-4	7.34	61.07	63.74	84.19	78.06	34.04
Merlinite-7B	LAB	Mixtral-8x7B- Instruct	7.66	64.88	63.99	84.37	78.24	44.58

LAB-aligned models Labradorite-13b and Merlinite-7B achieve MT-Bench scores of 7.23 and 7.66 respectively.
Labradorite-13b achieves MT-Bench 7.23 and MMLU 58.89; Merlinite-7B achieves MT-Bench 7.66 and MMLU 64.88.
On ARC, HellaSwag, Winogrande, and GSM8K, LAB models show strong performance relative to baselines (values reported in Table 3).
The LAB approach uses Mixtral-8x7B-Instruct as teacher and open weights, avoiding GPT-4, with competitive results on multiple benchmarks.
Two-phase training with replay buffers yields better benchmark performance and reduces catastrophic forgetting.
LAB data generation produced 1.2 million samples split roughly between knowledge-based and skill-based data.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。