QUICK REVIEW

[論文レビュー] Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Scott Barnett, Zac Brannelly|arXiv (Cornell University)|Jun 17, 2024

Topic Modeling被引用数 6

ひとこと要約

Retrieval-Augmented Generation (RAG) パイプライン内での LLM の微調整は、複数のデータセットとドメインにわたって、ベースラインモデルと比較して性能を低下させることが多い。

ABSTRACT

Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: "To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case." This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.

研究の動機と目的

複数のドメインにまたがる RAG パイプラインで、LLMs の微調整が質問応答機能を改善するかを評価する。
微調整用の学習データセットのサイズが性能にどう影響するかを調査する。
公開データセット上で、微調整済みモデルと微調整なしのベースラインモデルを比較する。

提案手法

BioASQ、Natural Questions、Qasper データセット上で、RAG パイプラインにおける 3 標準モデル（Mistral、LlaMA2、GPT-4）の評価。
Mistral と LlaMA2 を各データセットにつき 200、500、1000 QA ペアで微調整し、ベースラインと比較する。
エポック数、実効バッチサイズ、LoRa/QLoRa、LoRa のハイパーパラメータなど、さまざまなハイパーパラメータを、最大で 4 個の H100 または 8 個の A100 を搭載したハードウェア上で使用する。
CoT + form-filling アプローチを用いた G-Evals ベースのフレームワークで性能を評価し、正確性と完全性を測定する。
スコアを平均化して安定化させるため、繰り返しモデル判断を提供する（10 回の実行）。

Figure 1: Comparisons of accuracy for fine-tuned Llama2 models and baseline models across three datasets.

実験結果

リサーチクエスチョン

RQ1複数のデータセットに跨って、微調整が RAG ベースの QA の性能をベースラインモデルより改善するか？
RQ2RAG 強化 LLM における微調整データセットのサイズが性能にどう影響するか？
RQ3いくつかのモデル（例：Mixtral 対 Llama2）は、他より微調整による劣化を受けやすいか？

主な発見

ベースラインモデル（Mixtral、Llama2、GPT-4）は、NQ を除くすべてのデータセットで、微調整済みモデルを一般的に上回る。
GPT-4 のベースラインは、正確性と完全性の双方で微調整済みバリアントを上回る。
微調整済みモデルは、いくつかのケースで大幅な低下を示す（例：正確性と完全性が低下し、200サンプルの微調整では Llama2 の正確性が 4.38 から 3.14、完全性が 4.55 から 2.35 へ低下）。
Qasper データセットは、微調整済みの Llama2 および Mixtral モデルで顕著な正確性の低下を示す；微調整データの増加が時に性能を悪化させる（例：1000 サンプルを使用すると Mixtral の正確性が 4.04 から 3.28 に低下）。
いくつかの事例では、より大きな微調整データセットが RAG パイプラインの性能向上につながらない。

Figure 2: Comparisons of accuracy for fine-tuned Mixtral models and baseline models across three datasets.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。