Skip to main content
QUICK REVIEW

[論文レビュー] LAB: Large-Scale Alignment for ChatBots

Shivchander Sudalairaj, Abhishek Bhandwaldar|arXiv (Cornell University)|Mar 2, 2024
Topic Modeling被引用数 5
ひとこと要約

LAB は、GPT-4 を使わずに整列を拡張するための分類法に基づく合成データ生成と多段階の指示チューニングフレームワークを導入し、競合的なベンチマークを達成します。

ABSTRACT

This work introduces LAB (Large-scale Alignment for chatBots), a novel methodology designed to overcome the scalability challenges in the instruction-tuning phase of large language model (LLM) training. Leveraging a taxonomy-guided synthetic data generation process and a multi-phase tuning framework, LAB significantly reduces reliance on expensive human annotations and proprietary models like GPT-4. We demonstrate that LAB-trained models can achieve competitive performance across several benchmarks compared to models trained with traditional human-annotated or GPT-4 generated synthetic data. Thus offering a scalable, cost-effective solution for enhancing LLM capabilities and instruction-following behaviors without the drawbacks of catastrophic forgetting, marking a step forward in the efficient training of LLMs for a wide range of applications.

研究の動機と目的

  • Motivate scalable instruction tuning without heavy reliance on human annotations or proprietary models.
  • Propose a taxonomy-guided synthetic data generation process to diversify instruction data.
  • Develop a multi-phase training framework with replay buffers to prevent catastrophic forgetting.
  • Show that LAB-trained models achieve competitive performance on standard benchmarks.

提案手法

  • Define a taxonomy with branches for knowledge, foundational skills, and compositional skills to curate instruction data.
  • Use taxonomy-guided synthetic data generators to create large-scale diverse instruction data without GPT-4 or extensive human curation.
  • Implement a two-phase training regime (knowledge tuning followed by skills tuning) with a replay buffer to mitigate forgetting.
  • Evaluate using LMSYS benchmarks (MT-Bench, MMLU, ARC, HellaSwag, Winogrande, GSM8K) and compare against baselines.
(a) Input distributions
(a) Input distributions

実験結果

リサーチクエスチョン

  • RQ1Can taxonomy-guided synthetic data generation reduce reliance on proprietary models while maintaining instruction-following performance?
  • RQ2Does a multi-phase training regime with replay buffers improve stability and prevent forgetting during large-scale alignment?
  • RQ3How do LAB-trained models perform on a comprehensive set of alignment benchmarks compared to human-annotated or GPT-4–generated data models?

主な発見

ModelAlignmentTeacherMT-BenchMMLUARCHellaSwagWinograndeGSM8K
Llama-2-13b-chatSFT + RLHFHuman annotators6.6554.5859.8182.5275.9334.80
Orca-2Progressive TrainingGPT-46.1560.3759.7379.8678.2248.22
WizardLM-13BEvol- InstructGPT-47.2054.8360.2482.6276.4043.75
Labradorite-13bLABMixtral-8x7B- Instruct7.2358.8961.6983.1579.5640.11
Mistral-7B-InstructSFTPublic Datasets6.8460.3763.6584.7676.8041.85
Zephyr-7b-βSFT + DPOGPT-47.3461.0763.7484.1978.0634.04
Merlinite-7BLABMixtral-8x7B- Instruct7.6664.8863.9984.3778.2444.58
  • LAB-aligned models Labradorite-13b and Merlinite-7B achieve MT-Bench scores of 7.23 and 7.66 respectively.
  • Labradorite-13b achieves MT-Bench 7.23 and MMLU 58.89; Merlinite-7B achieves MT-Bench 7.66 and MMLU 64.88.
  • On ARC, HellaSwag, Winogrande, and GSM8K, LAB models show strong performance relative to baselines (values reported in Table 3).
  • The LAB approach uses Mixtral-8x7B-Instruct as teacher and open weights, avoiding GPT-4, with competitive results on multiple benchmarks.
  • Two-phase training with replay buffers yields better benchmark performance and reduces catastrophic forgetting.
  • LAB data generation produced 1.2 million samples split roughly between knowledge-based and skill-based data.
(b) Output distributions
(b) Output distributions

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。