QUICK REVIEW

[论文解读] Is ChatGPT a Biomedical Expert? -- Exploring the Zero-Shot Performance of Current GPT Models in Biomedical Tasks

Samy Ateia, Udo Kruschwitz|arXiv (Cornell University)|Jun 28, 2023

Artificial Intelligence in Healthcare and Education被引用 12

一句话总结

本论文在 BioASQ 2023 任务上评估 GPT-3.5-Turbo 和 GPT-4，在带片段的生物医学问答中展示出强劲的零-shot 表现，并分析查询扩展、 grounding 与 prompting 对检索和 NER 任务的影响。

ABSTRACT

We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive abilities with leading systems. Remarkably, they achieved this with simple zero-shot learning, grounded with relevant snippets. Even without relevant snippets, their performance was decent, though not on par with the best systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was able to compete with GPT-4 in the grounded Q&A setting on factoid and list answers. In Task 11b Phase A, focusing on retrieval, query expansion through zero-shot learning improved performance, but the models fell short compared to other systems. The code needed to rerun these experiments is available through GitHub.

研究动机与目标

评估 GPT-3.5-Turbo 和 GPT-4 在 BioASQ Task 11b Phase A（检索与片段提取）和 Phase B（答案生成）中的零-shot 与带定位的问答性能。
在西班牙语及 SNOMED CT 映射中，使用零-shot 和少量-shot 提示评估 MedProcNER 的性能。
研究查询扩展、带片段的 grounding，以及提示策略对生物医学问答系统有效性的影响。
提供开源代码并讨论局限性，包括提示设计、非确定性及成本考量。

提出的方法

通过 OpenAI API 使用 GPT-3.5-Turbo 和 GPT-4，并采用 BioASQ-system 提示。
对检索（Phase A）应用零-shot 提示，包括查询扩展、改写和 PubMed 结果的再排序。
在 Phase B 使用 gold snippets 对 GPT 输出进行 grounding，并测试多种答案格式（Ideal, Yes/No, List, Factoid）。
将 MedProcNER 任务的提示翻译并改写为西班牙语，并比较少-shot 与零-shot 设置。
使用 BioASQ 指标（MAP, GMAP, accuracy, F1, MRR）衡量性能，并按批次报告分割结果。
提供用于复现的公开 GitHub 仓库代码。

实验结果

研究问题

RQ1GPT-3.5-Turbo 和 GPT-4 是否能在 BioASQ Phase B（答案生成）中与顶尖系统竞争，利用零-shot 提示并以相关片段进行 grounding？
RQ2查询扩展如何影响 Phase A 的检索性能， grounding 与重新排序如何影响结果？
RQ3在 Yes/No、Factoid、List、Ideal 答案格式的 grounding 与非 grounding 设置下，GPT-3.5-Turbo 与 GPT-4 的比较性能如何？
RQ4GPT-4 在西班牙语的 MedProcNER 任务（NER、Entity Linking、Indexing）中以零-shot 和少-shot 提示的表现如何？
RQ5在研究中使用这些模型进行生物医学问答的实际考虑（成本、确定性、可靠性）是什么？

主要发现

GPT-3.5-Turbo 与 GPT-4 在 Task 11b Phase B 展现出有竞争力的零-shot 性能，在与片段 grounding 时经常达到领先系统的水平。
查询扩展在各模型上均提升检索性能，但增益因批次和模型而异。
GPT-4 一般在 Yes/No grounding 上优于 GPT-3.5-Turbo，但在 Factoid 与 List 格式上存在变异性，grounded 的 GPT-4 与 GPT-3.5-Turbo之间没有明确的总体胜者。
在 Phase A，带片段的 grounding 可以提升性能；若不 grounding，结果尚可但通常落后于顶尖系统。
MedProcNER 结果显示 GPT-4 优于 GPT-3.5-Turbo，但在 NER、Entity Linking、Indexing 上仍落后于顶尖系统；少-shot 的 NER 有帮助但得分较低。
研究强调提示工程是一个主要挑战，并指出非确定性和成本是现实应用中的重要考量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。