Skip to main content
QUICK REVIEW

[論文レビュー] Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering

Yan Hu, Chen, Qingyu|arXiv (Cornell University)|Mar 29, 2023
Topic Modeling被引用数 67
ひとこと要約

この論文は臨床 NER タスクにおける GPT-3.5 および GPT-4 を評価し、タスク固有のプロンプトフレームワーク( baseline、 annotation guidelines、 error-analysis instructions、 and few-shot samples)を導入して性能を向上させるものの、BioClinicalBERT が依然として最も強力な基準となる。非常に少ない学習データでの適用は有望である。

ABSTRACT

Objective: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance. Materials and Methods: We evaluated these models on two clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) identifying nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT. Results: Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples, and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all four components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed. Conclusion: While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

研究の動機と目的

  • Assess zero-shot and few-shot capabilities of GPT-3.5/GPT-4 on clinical NER tasks (i2b2-inspired and VAERS).
  • Develop a task-specific prompt framework to incorporate medical knowledge and guidelines.
  • Compare GPT models to BioClinicalBERT and traditional methods (CRF).
  • Provide publicly available code and datasets for reproducibility.

提案手法

  • Evaluate GPT-3.5-turbo-0301 and GPT-4-0314 on two clinical NER tasks (MTSamples/VAERS).
  • Fine-tune BioClinicalBERT and implement CRF as baselines for supervised learning.
  • Develop a four-component prompt framework: baseline task description, annotation guideline prompts, error-analysis instructions, and annotated few-shot samples.
  • Measure precision, recall, and F1 under exact-match and relaxed-match criteria.
  • Analyze errors to understand boundary and entity-type challenges.

実験結果

リサーチクエスチョン

  • RQ1How do GPT-3.5 and GPT-4 perform on clinical NER tasks in zero-shot and few-shot settings?
  • RQ2Does a task-specific prompting framework improve clinical NER performance for LLMs?
  • RQ3How do GPT models compare to BioClinicalBERT and CRF on MTSamples and VAERS datasets?
  • RQ4What is the impact of annotated exemplars (1-shot vs 5-shot) on NER performance?

主な発見

  • BioClinicalBERT remains the strongest method, with F1 of 0.901 (relaxed) on MTSamples and 0.802 (relaxed) on VAERS.
  • GPT-3.5 and GPT-4 show notable gains when using the four-component prompt framework, with GPT-4 achieving 0.861 (relaxed) on MTSamples and 0.736 (relaxed) on VAERS using 5-shot examples.
  • GPT-4 with five-shot prompts reaches 0.593 (exact) and 0.861 (relaxed) on MTSamples, and 0.542 (exact) and 0.736 (relaxed) on VAERS.
  • GPT-3.5 with five-shot prompts achieves 0.593 (in relaxed) on MTSamples and 0.736 (relaxed) on VAERS (exact numbers reported in the study).
  • GPT-3.5 and GPT-4 show larger absolute gains on VAERS than on MTSamples when adding guideline-, error-analysis-, and samples-based prompts.
  • The proposed prompting approach demonstrates feasibility of using LLMs for clinical NER with minimal annotated data, though not yet surpassing BioClinicalBERT in all settings.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。