QUICK REVIEW

[論文レビュー] LIMA: Less Is More for Alignment

Chunting Zhou, Pengfei Liu|arXiv (Cornell University)|May 18, 2023

Topic Modeling被引用数 126

ひとこと要約

65B LLaMaモデルが、慎重に選定された1,000件のプロンプト/レスポンス（RLHFなし）で強力な整合性を達成し、人間評価でベースラインと同等または上回ることが多い。事前学習が指示チューニングの少量を超えて支配的であることを示唆する。

ABSTRACT

Large language models are trained in two stages: (1) unsupervised pretraining from raw text, to learn general-purpose representations, and (2) large scale instruction tuning and reinforcement learning, to better align to end tasks and user preferences. We measure the relative importance of these two stages by training LIMA, a 65B parameter LLaMa language model fine-tuned with the standard supervised loss on only 1,000 carefully curated prompts and responses, without any reinforcement learning or human preference modeling. LIMA demonstrates remarkably strong performance, learning to follow specific response formats from only a handful of examples in the training data, including complex queries that range from planning trip itineraries to speculating about alternate history. Moreover, the model tends to generalize well to unseen tasks that did not appear in the training data. In a controlled human study, responses from LIMA are either equivalent or strictly preferred to GPT-4 in 43% of cases; this statistic is as high as 58% when compared to Bard and 65% versus DaVinci003, which was trained with human feedback. Taken together, these results strongly suggest that almost all knowledge in large language models is learned during pretraining, and only limited instruction tuning data is necessary to teach models to produce high quality output.

研究の動機と目的

RLHFや人間の好みモデル化なしで、1,000件の高品質デモンストレーションだけで効果的に整合できる強力な事前学習済み言語モデルを実証する。
命令 tuning より事前学習に整合性が依存するかを、LIMAを最新のベースラインと比較して評価する。
データの多様性と品質と量の関係を調査し、マルチターン対話機能を評価する。

提案手法

標準の監視付き損失を使用して、1,000のデモンストレーション（コミュニティソース由来750、人工作成250）で65BパラメータのLLaMaモデル（LLaMa-65B）をファインチューニングする。
ファインチューニング中にユーザー/アシスタントのターンを識別するための特別なエンドオブターントークンを導入する。
人間の好みとGPT-4+アノテータ試験で300のプロンプトを横断して、LIMAをRLHFチューニング済みおよび他のベースラインと比較する。
7Bモデルを使ってデータの多様性、品質、数量のアブレーションを行い、各要因の影響を孤立させる。
ゼロショットと拡張対話チェーン拡張を用いてマルチターン対話能力を評価する。
小さな安全関連プロンプト集合を用いて安全性の挙動を評価し、失敗モードを分析する。

Figure 1 : Human preference evaluation, comparing LIMA to 5 different baselines across 300 test prompts.

実験結果

リサーチクエスチョン

RQ1RLHFや好みモデル化なしで、1,000件のデモンストレーションだけで事前学習済みLLMを効果的に整合させることができるか。
RQ2データの多様性と品質と、単なる量の影響は整合性の性能にどう影響するか。
RQ3わずかな量のキュレーション済み対話データはマルチターン対話能力をどの程度改善するか。
RQ4LIMAは人間評価とGPT-4ベースの評価で最先端の整合モデルとどう比較されるか。

主な発見

LIMAは人間およびGPT-4アノテータによる評価でDaVinci003およびAlpacaに対して競争力のある性能を達成する。
LIMAの出力の半分が絶対品質評価で優れていると評価される。
データ量を増やしても、プロンプトの多様性とデータ品質の向上なしには収益が低下する。
30の手作り対話チェーンを追加すると、マルチターン対話の品質が大幅に改善され（優れた割合が45.2%から76.1%へ）。
LIMAの整合のあるマルチターン対話はゼロ対話データでも現れ、対象対話の拡張で品質がスケールする。
安全プロンプトにおいて、訓練時の小さな安全重視サブセットで80%のケースに対して安全に応答する。

Figure 2 : Preference evaluation using GPT-4 as the annotator, given the same instructions provided to humans.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。