[论文解读] Self-Alignment with Instruction Backtranslation
本文提出 instruction backtranslation,一种迭代自训练方法,使用种子模型从未标注的网页数据生成并筛选高质量的 (instruction, output) 对,达到在不进行模型蒸馏的情况下的强指令遵循能力。
We present a scalable method to build a high quality instruction following language model by automatically labelling human-written text with corresponding instructions. Our approach, named instruction backtranslation, starts with a language model finetuned on a small amount of seed data, and a given web corpus. The seed model is used to construct training examples by generating instruction prompts for web documents (self-augmentation), and then selecting high quality examples from among these candidates (self-curation). This data is then used to finetune a stronger model. Finetuning LLaMa on two iterations of our approach yields a model that outperforms all other LLaMa-based models on the Alpaca leaderboard not relying on distillation data, demonstrating highly effective self-alignment.
研究动机与目标
- Motivate scalable instruction tuning without heavy reliance on human-annotated data or distillation.
- Introduce a two-step self-training pipeline (self-augmentation and self-curation) driven by the model itself.
- Demonstrate iterative improvement leading to competitive instruction-following models on benchmarks.
- Show data quality control is essential for effective scaling of instruction-following models.
提出的方法
- Initialize with a small seed set of (instruction, output) pairs and a large unlabelled web corpus.
- Self-augmentation: fine-tune a backward model to generate candidate instructions for unlabelled outputs, creating (instruction, output) pairs.
- Self-curation: use a seed instruction model to score augmented pairs and select high-quality examples for finetuning, iterating to build a stronger model.
- Tag augmented and seed data with system prompts to guide training and inference.
- Experiment with 7B, 33B, and 65B LLaMA models, and scale data across multiple augmentation iterations (two iterations of self-curation).
- Evaluate via AlpacaEval (GPT-4 judgments) and human preferences, plus zero-shot NLP benchmarks.
实验结果
研究问题
- RQ1Can a seed instruction-following model bootstrap high-quality instruction data from a large unlabelled web corpus without external supervision?
- RQ2Does self-curation improve the quality of augmented data enough to warrant iterative retraining?
- RQ3How does data quality vs. quantity impact instruction-following performance in self-aligned models?
- RQ4What are the effects of data tagging and system prompts on training and inference?
- RQ5How does the approach scale across model sizes and compare to non-distilled baselines on standard benchmarks?
主要发现
- The self-augmentation plus self-curation pipeline (two iterations) yields a model (Humpback) that outperforms non-distilled LLaMA-based models on Alpaca leaderboard benchmarks.
- Training on high-quality augmented data significantly improves instruction-following performance compared to using all augmented data or seed data alone.
- Data quality emphasis yields better gains than merely increasing data volume, contrasting with the superficial alignment hypothesis.
- Joint training of seed and self-augmented data with appropriate system prompts enhances performance and safety considerations.
- Scaling to larger models (e.g., 65B) with high-quality augmented data yields further improvements over smaller models.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。