QUICK REVIEW

[論文レビュー] OpenChat: Advancing Open-source Language Models with Mixed-Quality Data

Guan Wang, Sijie Cheng|arXiv (Cornell University)|Sep 20, 2023

Topic Modeling被引用数 26

ひとこと要約

OpenChatは、混合品質データを用いて人間の好みラベルなしでオープンソースLLMをファインチューニングする条件付きRLFTを導入し、複数のベンチマークで13Bオープンソースモデルの中で最先端の結果を達成します。

ABSTRACT

Nowadays, open-source large language models like LLaMA have emerged. Recent developments have incorporated supervised fine-tuning (SFT) and reinforcement learning fine-tuning (RLFT) to align these models with human goals. However, SFT methods treat all training data with mixed quality equally, while RLFT methods require high-quality pairwise or ranking-based preference data. In this study, we present a novel framework, named OpenChat, to advance open-source language models with mixed-quality data. Specifically, we consider the general SFT training data, consisting of a small amount of expert data mixed with a large proportion of sub-optimal data, without any preference labels. We propose the C(onditioned)-RLFT, which regards different data sources as coarse-grained reward labels and learns a class-conditioned policy to leverage complementary data quality information. Interestingly, the optimal policy in C-RLFT can be easily solved through single-stage, RL-free supervised learning, which is lightweight and avoids costly human preference labeling. Through extensive experiments on three standard benchmarks, our openchat-13b fine-tuned with C-RLFT achieves the highest average performance among all 13b open-source language models. Moreover, we use AGIEval to validate the model generalization performance, in which only openchat-13b surpasses the base model. Finally, we conduct a series of analyses to shed light on the effectiveness and robustness of OpenChat. Our code, data, and models are publicly available at https://github.com/imoneoi/openchat and https://huggingface.co/openchat.

研究の動機と目的

専門家データとサブ最適データを含む混合品質のSFTデータを利用して、好みのラベルなしでオープンソースLLMをファインチューニングする動機づけ。
軽量なRL-freeトレーニング目的を開発し、データソースからの粗い報酬を利用する。
ファインチューニング中にデータ品質を区別するためのクラス条件付きポリシーと参照ポリシーを導入。
OpenChat with C-RLFTが標準ベンチマークで優れた指示遵守性能を達成することを実証。

提案手法

データソースにラベルを付けてクラス条件付きデータセットを定義する（例：GPT-4対GPT-3.5）。
C-RLFTを提案: KL正則化をクラス条件付き参照ポリシーπcに向けつつ、クラス条件付きポリシーπθ(y|x,c)としてファインチューニングする。
最適ポリシーをクラス条件付き報酬加重回帰として導出し、教師あり学習で実装する（RLループは不要）。
データが専門データには1、サブ最適データにはα（0<α<1）を割り当てる粗い報酬rc(x,y)を用い、目的関数で指数的ウェイト付けを行う。
AdamWを用い、ShareGPTデータでopenchat-13b（llama-2-13bベース）を5エポック訓練し、単純な報酬加重回帰目的を使用。
データソース品質を反映したクラス条件付きプロンプトで推論し、高品質な応答を生成。

実験結果

リサーチクエスチョン

RQ1高価な好みデータなしで、混合品質のSFTデータ（専門家＋サブ最適）を効果的に活用してオープンソースLLMをファインチューニングできるか。
RQ2クラス条件付きポリシーと粗い報酬信号は、標準SFTやRLHFアプローチより指示遵守を改善するか。
RQ3RLなしの報酬加重監督（C-RLFT）は、標準ベンチマークで既存のオープンソースモデルを上回るのに十分か。
RQ4データソース品質（GPT-4対GPT-3.5）は、ファインチューニングモデルの汎化と頑健性にどのように影響するか。

主な発見

Model	Base Model	Method	AlpacaEval	MT-bench	Vicuna-bench	Average
gpt-4	-	SFT + RLFT	95.3	82.5	90.0	89.3
llama-2-70b	llama-2-70b	SFT + RLFT	92.7	60.0	87.5	80.1
claude	-	SFT + RLFT	88.4	65.0	76.3	76.6
gpt-3.5-turbo	-	SFT + RLFT	86.1	50.0	50.0	62.0
guanaco-65b	llama-65b	SFT	71.8	40.6	49.4	53.9
guanaco-33b	llama-33b	SFT	66.0	40.6	54.4	53.7
vicuna-v1.1-13b	llama-13b	SFT	70.4	29.4	45.0	48.3
wizardlm-v1.0-13b	llama-13b	SFT	75.3	33.1	44.4	50.9
vicuna-v1.5-13b	llama-2-13b	SFT	78.8	37.2	47.1	54.4
ultralm-13b	llama-13b	SFT	80.6	37.2	50.0	55.9
wizardlm-v1.2-13b	llama-2-13b	SFT	89.2	53.1	80.6	74.3
llama-2-chat-13b	llama-2-13b	SFT + RLFT	81.1	55.3	86.9	74.4
openchat-13b	llama-2-13b	C-RLFT	89.5	57.5	85.0	77.3

OpenChat with C-RLFT achieves the highest average win-rate among 13B open-source models across AlpacaEval, MT-bench, and Vicuna-bench.
OpenChat-13b surpasses many larger models and can even outperform GPT-3.5-turbo on all three benchmarks.
AGIEval results show OpenChat-13b achieves top-1 average accuracy among 13B open-source models, indicating good generalization.
Ablation studies show removing coarse-grained rewards or the class-conditioned policy degrades performance, while only-SFT training yields lower scores.
Visual analyses indicate the model learns to distinguish data-source quality in representations, reflecting the effectiveness of C-RLFT.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。