QUICK REVIEW

[논문 리뷰] Fine-Tuning or Fine-Failing? Debunking Performance Myths in Large Language Models

Scott Barnett, Zac Brannelly|arXiv (Cornell University)|2024. 06. 17.

Topic Modeling인용 수 6

한 줄 요약

검색 증강 생성(RAG) 파이프라인에서 LLM 미세조정은 여러 데이터셋과 도메인에 걸쳐 항상 기초 모델에 비해 성능을 저하시키는 경향이 있다.

ABSTRACT

Large Language Models (LLMs) have the unique capability to understand and generate human-like text from input queries. When fine-tuned, these models show enhanced performance on domain-specific queries. OpenAI highlights the process of fine-tuning, stating: "To fine-tune a model, you are required to provide at least 10 examples. We typically see clear improvements from fine-tuning on 50 to 100 training examples, but the right number varies greatly based on the exact use case." This study extends this concept to the integration of LLMs within Retrieval-Augmented Generation (RAG) pipelines, which aim to improve accuracy and relevance by leveraging external corpus data for information retrieval. However, RAG's promise of delivering optimal responses often falls short in complex query scenarios. This study aims to specifically examine the effects of fine-tuning LLMs on their ability to extract and integrate contextual data to enhance the performance of RAG systems across multiple domains. We evaluate the impact of fine-tuning on the LLMs' capacity for data extraction and contextual understanding by comparing the accuracy and completeness of fine-tuned models against baseline performances across datasets from multiple domains. Our findings indicate that fine-tuning resulted in a decline in performance compared to the baseline models, contrary to the improvements observed in standalone LLM applications as suggested by OpenAI. This study highlights the need for vigorous investigation and validation of fine-tuned models for domain-specific tasks.

연구 동기 및 목표

여러 도메인에 걸쳐 RAG 파이프라인에서 LLM 미세조정이 질문 응답 성능을 향상시키는지 평가한다.
미세조정을 위한 학습 데이터셋의 크기가 성능에 어떤 영향을 미치는지 조사한다.
공개 데이터셋에서 미세조정된 모델과 기준 비미세조정 모델을 비교한다.

제안 방법

BioASQ, Natural Questions, 및 Qasper 데이터셋에서 RAG 파이프라인으로 세 모델(Mistral, LlaMA2, GPT-4)을 평가한다.
각 데이터셋당 200, 500, 1000 QA 쌍으로 Mistral과 LlaMA2를 미세조정하고 기초모델과 비교한다.
에포크, 유효 배치 크기, LoRa/QLoRa, LoRa 하이퍼파라미터 등 다양한 하이퍼파라미터를 최대 4개의 H100 또는 8개의 A100이 장착된 하드웨어에서 사용한다.
CoT + 폼 채우기(form-filling) 접근법을 사용하는 G-Evals 기반 프레임워크로 성능을 평가하고 정확도와 완전성을 측정한다.
점수를 안정시키기 위해 평균화로 결정의 신뢰성을 높이기 위해 반복 평가를 제공한다(10회 실행).

Figure 1: Comparisons of accuracy for fine-tuned Llama2 models and baseline models across three datasets.

실험 결과

연구 질문

RQ1여러 데이터셋에 걸쳐 RAG 기반 QA 성능이 기초 모델 대비 미세조정으로 향상되는가?
RQ2RAG-강화 LLM에서 미세조정 데이터세트의 크기가 성능에 어떤 영향을 미치는가?
RQ3일부 모델(Mixtral 대 Llama2 등)이 다른 모델들보다 미세조정으로 인한 성능 저하에 더 취약한가?

주요 결과

기초 모델(Mixtral, Llama2, GPT-4)은 NQ를 제외한 모든 데이터셋에서 일반적으로 미세조정 버전보다 우수하다.
GPT-4 기초모델은 정확도와 완전성 면에서도 미세조정된 변형들보다 우수하다.
미세조정된 모델은 일부 경우에 상당한 하락을 보이며(예: 정확도 및 완전성 저하, 200샘플 미세조정에서 Llama2의 정확도가 4.38에서 3.14로, 완전성이 4.55에서 2.35로 하락).
Qasper 데이터셋은 미세조정된 Llama2 및 Mixtral 모델의 정확도 저하가 현저하며, 미세조정 데이터 크기를 늘려도 성능이 악화되는 경우가 있다(예: 1000샘플을 사용할 때 Mixtral의 정확도가 4.04에서 3.28로 하락).
여러 사례에서 더 큰 미세조정 데이터세트가 RAG 파이프라인에서 더 나은 성능으로 이어지지 않는다.

Figure 2: Comparisons of accuracy for fine-tuned Mixtral models and baseline models across three datasets.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.