QUICK REVIEW

[論文レビュー] Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Lena Schmidt, Kaitlyn Hair|arXiv (Cornell University)|May 23, 2024

Artificial Intelligence in Healthcare被引用数 9

ひとこと要約

本論文は、GPT-4を用いたデータ抽出の自動化による系統的レビューの迅速な実現可能性調査を実施し、領域間での正確性とばらつきを評価し、課題と評価アプローチを強調する。

ABSTRACT

This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

研究の動機と目的

LLMが系統的レビューにおけるデータ抽出を支援する方法を動機づけ、探求する。
LLM駆動の抽出のためのプロンプトテンプレートと評価プロトコルを開発する。
横断領域の性能を評価する（臨床ヒト、動物、そして社会科学分野）。
LLMsが高性能を示す部分と誤りが発生しやすい部分を特定し、今後のツール設計の指針とする。

提案手法

2023 Evidence Synthesis Hackathon中に実施された2つの実現可能性調査。
第一の研究: 分野研究からの研究特性の自動抽出; 各分野につき2件の研究をプロンプト開発に、10件を評価に用いた。
第二の研究: EBM-NLPデータセットの100件の要約に対してPICO(Participants, Interventions, Controls, Outcomes)をLLMが予測。
評価はBLEU/ROUGE指標に依存せず、手動で行われた。
予測のばらつきと応答品質の変化を特定。
データ抽出文脈におけるLLMsを評価するためのテンプレートを提示。

実験結果

リサーチクエスチョン

RQ1GPT-4は臨床、動物、社会科学の研究特性を正確に抽出できるか？
RQ2要約からPICOを特定する能力はどの程度か、どの要素が最も誤りやすいか？
RQ3BLEU/ROUGEを超えるLLMベースのデータ抽出に適した評価手法は何か？
RQ4LLMsをデータ抽出ワークフローに組み込む際の安定性と信頼性の考慮事項は何か？

主な発見

データ抽出タスクの全体的な正確性は約80%、領域差があり（82% 臨床人間、80% 動物、72% 社会科学）。
因果推論手法と研究設計は最もエラーが多かったデータ抽出項目だった。
PICO研究では、ParticipantsとInterventions/Controlsの正確性が高く（>80%）、Outcomesはより難しかった。
評価は手動で、BLEUやROUGEなどの従来指標は限定的な価値を示した。
予測のばらつきと、実行間での応答品質の変化が見られた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。