QUICK REVIEW

[论文解读] Exploring the use of a Large Language Model for data extraction in systematic reviews: a rapid feasibility study

Lena Schmidt, Kaitlyn Hair|arXiv (Cornell University)|May 23, 2024

Artificial Intelligence in Healthcare被引用 9

一句话总结

本文通过快速可行性研究，使用GPT-4自动化数据提取以进行系统综述，评估在不同领域的准确性和变异性，并强调挑战与评估方法。

ABSTRACT

This paper describes a rapid feasibility study of using GPT-4, a large language model (LLM), to (semi)automate data extraction in systematic reviews. Despite the recent surge of interest in LLMs there is still a lack of understanding of how to design LLM-based automation tools and how to robustly evaluate their performance. During the 2023 Evidence Synthesis Hackathon we conducted two feasibility studies. Firstly, to automatically extract study characteristics from human clinical, animal, and social science domain studies. We used two studies from each category for prompt-development; and ten for evaluation. Secondly, we used the LLM to predict Participants, Interventions, Controls and Outcomes (PICOs) labelled within 100 abstracts in the EBM-NLP dataset. Overall, results indicated an accuracy of around 80%, with some variability between domains (82% for human clinical, 80% for animal, and 72% for studies of human social sciences). Causal inference methods and study design were the data extraction items with the most errors. In the PICO study, participants and intervention/control showed high accuracy (>80%), outcomes were more challenging. Evaluation was done manually; scoring methods such as BLEU and ROUGE showed limited value. We observed variability in the LLMs predictions and changes in response quality. This paper presents a template for future evaluations of LLMs in the context of data extraction for systematic review automation. Our results show that there might be value in using LLMs, for example as second or third reviewers. However, caution is advised when integrating models such as GPT-4 into tools. Further research on stability and reliability in practical settings is warranted for each type of data that is processed by the LLM.

研究动机与目标

Motivate and explore how LLMs can assist data extraction in systematic reviews.
Develop prompt templates and evaluation protocols for LLM-driven extraction.
Assess cross-domain performance (human clinical, animal, and human social science).
Identify where LLMs perform well and where errors are more frequent to guide future tool design.

提出的方法

Two feasibility studies conducted during the 2023 Evidence Synthesis Hackathon.
First study: automatic extraction of study characteristics from domain studies; used two studies per domain for prompt development and ten for evaluation.
Second study: LLM predicted PICOs (Participants, Interventions, Controls, Outcomes) in 100 abstracts from the EBM-NLP dataset.
Evaluation was manual rather than relying on BLEU/ROUGE metrics.
Identified variability in predictions and changes in response quality.
Presented a template for evaluating LLMs in data extraction contexts.

实验结果

研究问题

RQ1Can GPT-4 accurately extract study characteristics across clinical, animal, and social science studies?
RQ2How well can the LLM identify PICOs from abstracts, and which components are most error-prone?
RQ3What evaluation methods are appropriate for LLM-based data extraction beyond BLEU/ROUGE?
RQ4What are the stability and reliability considerations when integrating LLMs into data extraction workflows?

主要发现

Overall accuracy around 80% for data extraction tasks, with domain variation (82% human clinical, 80% animal, 72% social sciences).
Causal inference methods and study design were the data extraction items with the most errors.
In the PICO study, participants and interventions/controls showed high accuracy (>80%), while outcomes were more challenging.
Evaluation was manual; traditional metrics like BLEU and ROUGE showed limited value.
There was variability in predictions and changes in response quality across runs.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。