QUICK REVIEW

[论文解读] Can large language models replace humans in the systematic review process? Evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages

Qusai Khraisha, S. Van Put|arXiv (Cornell University)|Oct 26, 2023

Artificial Intelligence in Healthcare and Education参考文献 37被引用 11

一句话总结

该研究在标题/摘要筛选、全文筛选和数据提取中，预注册并测试GPT-4的自主动性表现，覆盖同行评议、灰色文献和非英文文献，发现GPT-4在计及随机性和数据集不均衡后常低于人类，但在高度可靠的提示下，尤其是在全文筛选方面，能达到接近并列的结果。

ABSTRACT

Systematic reviews are vital for guiding practice, research, and policy, yet they are often slow and labour-intensive. Large language models (LLMs) could offer a way to speed up and automate systematic reviews, but their performance in such tasks has not been comprehensively evaluated against humans, and no study has tested GPT-4, the biggest LLM so far. This pre-registered study evaluates GPT-4's capability in title/abstract screening, full-text review, and data extraction across various literature types and languages using a 'human-out-of-the-loop' approach. Although GPT-4 had accuracy on par with human performance in most tasks, results were skewed by chance agreement and dataset imbalance. After adjusting for these, there was a moderate level of performance for data extraction, and - barring studies that used highly reliable prompts - screening performance levelled at none to moderate for different stages and languages. When screening full-text literature using highly reliable prompts, GPT-4's performance was 'almost perfect.' Penalising GPT-4 for missing key studies using highly reliable prompts improved its performance even more. Our findings indicate that, currently, substantial caution should be used if LLMs are being used to conduct systematic reviews, but suggest that, for certain systematic review tasks delivered under reliable prompts, LLMs can rival human performance.

研究动机与目标

评估GPT-4在系统综述主题的标题/摘要筛选、全文筛选和数据提取中的自主表现。
评估GPT-4在同行评议、灰色文献和非英文文献中的表现，包括灰色文献和多语言来源。
预注册并记录提示工程与分析，以理解LLM辅助筛选的可靠性与偏差。

提出的方法

通过ChatGPT界面使用GPT-4（2023年5月–9月）来筛选300个标题/摘要、150份全文，并从30份文档中提取数据。
测试四个纳入/排除的提示用于标题/摘要筛选；调整提示以管理数据量和上下文；对每项标准进行10项研究的测试–重测可靠性评估。
用真阳性、真阴性、假阳性和假阴性来衡量性能；报告灵敏度、特异度和准确度。
利用Cohen's kappa、PABAK以及加权Kappa来校正偶然一致性和数据集不平衡，以评估一致性质量。
平衡数据集并报告文献类型与语言特定的表现，包括一个高可靠性提示子组和非英文/灰色文献。
报告并解释人类评审者之间的一致性基线（Cohen’s kappa ~0.77）用于背景。

实验结果

研究问题

RQ1GPT-4是否能在不同文献类型和语言环境下，以与人类评审者相当的准确性自主筛选标题/摘要和全文？
RQ2GPT-4在同行评议、灰色和非英文研究中的数据提取表现如何？
RQ3提示可靠性与提示设计对GPT-4的筛选和提取性能有何影响？
RQ4偶然一致性和数据集平衡在多大程度上影响GPT-4在系统综述中的测量性能？

主要发现

平衡	灵敏度	特异性	准确性	Cohen Kappa *	加权Kappa	调整后的Kappa **
Title and abstract screening	English peer-reviewed	1	0.42	0.92	0.67	0.34	0.23	0.34
Title and abstract screening	English grey	1	0.48	0.84	0.66	0.32	0.24	0.32
Title and abstract screening	Other languages	0.05	0.50	0.89	0.88	0.21	0.40	0.75
Full text screening	English peer-reviewed	0.92	0.38	0.69	0.54	0.07	0.05	0.08
Full text screening	English grey	0.11	0.60	0.80	0.78	0.24	0.44	0.55
Full text screening	Other languages	0.09	1	0.95	0.96	-0.10	-0.11	0.64
Data extraction	High-reliability prompt group	0.05	0.36	0.94	0.85	0.65	0.97	0.91
Data extraction	English peer-reviewed	0.03	0.75	0.84	0.82	0.54	0.63	0.63
Data extraction	English grey	0.24	0.65	0.85	0.81	0.45	0.53	0.62
Data extraction	Other languages	0.20	0.36	0.94	0.85	0.35	0.29	0.69

GPT-4在某些任务（如经验数据和难民相关内容）显示出较高的可靠性，但在其他概念（如育儿行为和长期难民情势）上的可靠性较低。
在各阶段和语言中，GPT-4的灵敏度和特异性各异，通常特异性很高 (>0.8)，灵敏度则因文献类型和阶段而异，范围从0.36到0.75。
在英文同行评议全文筛选中，准确度相对较低(0.69)，而非英文数据集的准确度较高(全文0.96)以及提取(英文同行评议0.84)。
一个高可靠性提示子样本在加权后几乎达到完全一致（kappa约0.85–0.97），表明提示质量对性能具有关键作用。
总体而言，在考虑不平衡和偶然一致性后，GPT-4的表现常落后于人类，除非在全文筛选的高度可靠提示条件下观察到接近完美的表现。
研究强调在将LLMs广泛用于系统综述时需谨慎，同时注意在任务特定、提示可靠性情境下可能达到与人类相当的表现。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。