QUICK REVIEW

[論文レビュー] Improving Small Language Models on PubMedQA via Generative Data Augmentation

Zhen Guo, Peiqi Wang|arXiv (Cornell University)|May 12, 2023

Topic Modeling被引用数 10

ひとこと要約

本論文は、LLMベースの生成データ拡張を用いて PubMedQA の小規模医療言語モデルを強化し、拡張後に sub-1.6B パラメータのモデルで GPT-4 の few-shot を上回る性能を達成する。

ABSTRACT

Large Language Models (LLMs) have made remarkable advancements in the field of natural language processing. However, their increasing size poses challenges in terms of computational cost. On the other hand, Small Language Models (SLMs) are known for their efficiency, but they often struggle with limited capacity and training data, especially in specific domains. In this paper, we introduce a novel method aimed at improving SLMs in the medical domain using LLM-based generative data augmentation. The objective of our approach is to develop more efficient and capable models that are specifically tailored for specialized applications. Through experiments conducted on the PubMedQA dataset, we demonstrate the effectiveness of LLMs in refining and diversifying existing question-answer pairs. This refinement process leads to improved performance in a significantly smaller model after fine-tuning. Notably, our best SLM, with under 1.6 billion parameters, outperforms the few-shot GPT-4 on the PubMedQA dataset. Our code and generated data are publicly available to facilitate further explorations.

研究の動機と目的

PubMedQA における医療QAのための小規模言語モデル（SLMs）を、巨大パラメータを使わずに改善する。
LLMベースのデータ拡張を用いて QA ペアを洗練させ、多様化する。
ファインチューニング手法の頑健性とドメイン知識の重要性を示す。

提案手法

SLMs に対する堅牢なファインチューニング技法として低秩適応（LoRA）を適用する。
Hyperparameters を横断して BioGPT-Large の Low-rank Adaptation と Prefix Tuning を比較する。
拡張のために LLMs（GPT-3.5、GPT-4、BioGPT など）を用いて QA ペアを書き換えたり新規作成したりする。
拡張された PubMedQA データで BioGPT-Large、LLaMA-7b、Alpaca-7b をファインチューニングする。
450/50/500 の train/val/test 分割で accuracy と macro-F1 を用いて評価する。

実験結果

リサーチクエスチョン

RQ1ドメインに配慮した LLM を用いた生成データ拡張は、PubMedQA における小規模医療 LMs の性能を改善するか？
RQ2ドメイン特化型QAに対してどのファインチューニング手法（LoRA 対 Prefix Tuning）がより頑健か？
RQ3データ拡張におけるドメイン特有の知識が PubMedQA の下流 QA 性能にどのように影響するか？
RQ4小規模モデルをファインチューニングした場合、GPT-4 由来の拡張データは few-shot GPT-4 の性能を上回るか？
RQ5完全に合成されたドメイン知識を使用することと、既存の QA ペアの洗練を比較した場合の影響は？

主な発見

LoRA は Prefix Tuning より優れており、ハイパーパラメータ範囲（16–512 トークン範囲）全般でより頑健である。
原稿 PubMedQA の拡張なしでファインチューニングした場合、BioGPT-Large、LLaMA-7b、Alpaca-7b はドメイン適応性の変化を示す。
ドメイン認識された LLM によって生成された拡張データ（例：GPT-4 の書き換え/新規 QA ペア）は PubMedQA の SLM 性能を大幅に向上させ、GPT-4 拡張が GPT-3.5 ベースのアプローチを上回る。
拡張された PubMedQA データでファインチューニングした BioGPT は PubMedQA で LLaMA-7B を上回り、ドメイン固有の事前学習の利点を裏付ける。
ドメインに依存しない LLM（GPT-3.5-turbo）に完全に新規 QA ペアの生成を指示すると性能が低下する可能性があるが、ドメイン認識型 LLM（GPT-4）は有益な新規トレーニングデータを提供する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。