QUICK REVIEW

[論文レビュー] Is ChatGPT a Biomedical Expert? -- Exploring the Zero-Shot Performance of Current GPT Models in Biomedical Tasks

Samy Ateia, Udo Kruschwitz|arXiv (Cornell University)|Jun 28, 2023

Artificial Intelligence in Healthcare and Education被引用数 12

ひとこと要約

本論文はBioASQ 2023タスクにおけるGPT-3.5-TurboとGPT-4を評価し、スニペットを用いた生物医科学QAでのゼロショット性能が高いことを示し、検索とNERタスクにおけるクエリ拡張・グラウンディング・プロンプティングの影響を分析する。

ABSTRACT

We assessed the performance of commercial Large Language Models (LLMs) GPT-3.5-Turbo and GPT-4 on tasks from the 2023 BioASQ challenge. In Task 11b Phase B, which is focused on answer generation, both models demonstrated competitive abilities with leading systems. Remarkably, they achieved this with simple zero-shot learning, grounded with relevant snippets. Even without relevant snippets, their performance was decent, though not on par with the best systems. Interestingly, the older and cheaper GPT-3.5-Turbo system was able to compete with GPT-4 in the grounded Q&A setting on factoid and list answers. In Task 11b Phase A, focusing on retrieval, query expansion through zero-shot learning improved performance, but the models fell short compared to other systems. The code needed to rerun these experiments is available through GitHub.

研究の動機と目的

Assess zero-shot and grounded QA performance of GPT-3.5-Turbo and GPT-4 on BioASQ Task 11b Phase A (retrieval and snippet extraction) and Phase B (answer generation).
Evaluate MedProcNER performance using zero-shot and few-shot prompting in Spanish and SNOMED CT mapping.
Investigate the effects of query expansion, grounding with snippets, and prompting strategies on system effectiveness in biomedical QA.
Provide open-source code and discussion on limitations, including prompt design, non-determinism, and cost considerations.

提案手法

Use GPT-3.5-Turbo and GPT-4 via OpenAI API with a BioASQ-system prompt.
Apply zero-shot prompts for retrieval (Phase A) including query expansion, reformulation, and reranking of PubMed results.
Ground GPT outputs with gold snippets in Phase B and test various answer formats (Ideal, Yes/No, List, Factoid).
Translate and adapt prompts for MedProcNER tasks to Spanish and compare few-shot vs zero-shot settings.
Measure performance using BioASQ metrics (MAP, GMAP, accuracy, F1, MRR) and report split results by batch.
Provide public GitHub repository with code for replication.]

実験結果

リサーチクエスチョン

RQ1Can GPT-3.5-Turbo and GPT-4 compete with top systems in BioASQ Phase B (answer generation) using zero-shot prompts grounded in relevant snippets?
RQ2How does query expansion affect retrieval performance in Phase A, and how do grounding and reranking influence results?
RQ3What is the comparative performance of GPT-3.5-Turbo vs GPT-4 in grounded vs ungrounded settings across Yes/No, Factoid, List, and Ideal answer formats?
RQ4How well does GPT-4 perform on MedProcNER tasks (NER, Entity Linking, Indexing) in Spanish with zero-shot and few-shot prompting?
RQ5What are the practical considerations (cost, determinism, reliability) of using these models for biomedical QA in research?

主な発見

GPT-3.5-Turbo and GPT-4 show competitive zero-shot performance in Task 11b Phase B, often matching leading systems when grounded with snippets.
Query expansion improves retrieval performance across models, though gains vary by batch and model.
GPT-4 generally outperforms GPT-3.5-Turbo in Yes/No grounding, but in Factoid and List formats there is variability with no clear overall winner between grounded GPT-4 and GPT-3.5-Turbo.
In Phase A, grounding with snippets enhances performance; without grounding, results are decent but typically behind top systems.
MedProcNER results show GPT-4 outperforms GPT-3.5-Turbo but remains behind the top system in NER, Entity Linking, and Indexing; few-shot NER helps but with lower scores.
The study highlights prompt engineering as a major challenge and notes non-determinism and cost as important practical considerations for real-world use.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。