[論文レビュー] Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision
Self-Align は 原理主導の自己整合性を用いて言語モデルをゼロから訓練することで人間の監督を減らす; Dromedary (LLaMA-65B) は300 行未満の注釈で強力な結果を達成。
Recent AI-assistant agents, such as ChatGPT, predominantly rely on supervised fine-tuning (SFT) with human annotations and reinforcement learning from human feedback (RLHF) to align the output of large language models (LLMs) with human intentions, ensuring they are helpful, ethical, and reliable. However, this dependence can significantly constrain the true potential of AI-assistant agents due to the high cost of obtaining human supervision and the related issues on quality, reliability, diversity, self-consistency, and undesirable biases. To address these challenges, we propose a novel approach called SELF-ALIGN, which combines principle-driven reasoning and the generative power of LLMs for the self-alignment of AI agents with minimal human supervision. Our approach encompasses four stages: first, we use an LLM to generate synthetic prompts, and a topic-guided method to augment the prompt diversity; second, we use a small set of human-written principles for AI models to follow, and guide the LLM through in-context learning from demonstrations (of principles application) to produce helpful, ethical, and reliable responses to user's queries; third, we fine-tune the original LLM with the high-quality self-aligned responses so that the resulting model can generate desirable responses for each query directly without the principle set and the demonstrations anymore; and finally, we offer a refinement step to address the issues of overly-brief or indirect responses. Applying SELF-ALIGN to the LLaMA-65b base language model, we develop an AI assistant named Dromedary. With fewer than 300 lines of human annotations (including < 200 seed prompts, 16 generic principles, and 5 exemplars for in-context learning). Dromedary significantly surpasses the performance of several state-of-the-art AI systems, including Text-Davinci-003 and Alpaca, on benchmark datasets with various settings.
研究の動機と目的
- LLM整列のための高価な人間の監督への依存を減らす。
- 実用的な4段階のパイプライン(Self-Instruct、Principle-Driven Self-Alignment、Principle Engraving、Verbose Cloning)を開発する。
- ベースモデル(LLaMA-65B)をゼロから整列させて、最小限の人間入力でいくつかのベースラインを上回ることを実証する。
- 監督効率の良い整列に向けた研究を進めるため、オープンソースのコード、ウェイト、合成データを提供する。
提案手法
- Topic-Guided Red-Teaming Self-Instruct to generate diverse synthetic instructions and prompts.
- Principle-Driven Self-Alignment using 16 human-written principles and 5 exemplars for in-context learning demonstrations.
- Principle Engraving: fine-tune the base model on self-aligned outputs while pruning demonstrations and principles.
- Verbose Cloning: train a verbose, context-distilled model to produce more comprehensive responses.
- In-context demonstrations guide the model to comply with the principles during response generation.
- Fine-tuning is performed to embed principle-aligned behavior directly in the model parameters.
実験結果
リサーチクエスチョン
- RQ1Can language models be aligned from scratch with minimal human supervision using a principle-driven framework?
- RQ2How does the inclusion of a small set of principles and exemplars affect alignment performance across benchmarks?
- RQ3What is the impact of the Verbose Cloning step on generation quality and on various evaluation metrics?
- RQ4How does Self-Align compare to RLHF/CAI-based approaches in terms of supervision efficiency and safety/quality trade-offs?
主な発見
- Dromedary-65B (final) surpasses several baselines like Text-Davinci-003 and Alpaca on benchmark datasets under various settings.
- TruthfulQA MC1 accuracy reaches 69 with a modified ranking approach, outperforming GPT-4 and other open-source models in the MC task.
- In BIG-bench HHH Eval, Dromedary shows significantly improved harmlessness and overall performance relative to open-source baselines and is only marginally below ChatGPT/ Vicuna distillates.
- Verbose Cloning improves generation quality on certain benchmarks (e.g., Vicuna benchmark questions) but can incur a verbose tax, reducing performance on some multiple-choice rankings.
- The approach achieves strong results while using fewer than 300 lines of human annotations, highlighting supervision efficiency.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。