QUICK REVIEW

[논문 리뷰] Improving Large Language Models for Clinical Named Entity Recognition via Prompt Engineering

Yan Hu, Chen, Qingyu|arXiv (Cornell University)|2023. 03. 29.

Topic Modeling인용 수 67

한 줄 요약

이 논문은 GPT-3.5와 GPT-4를 임상 NER 태스크에 평가하고, 태스크 특화 프롬프트 프레임워크(기준선, 주석 가이드라인, 오류 분석 지시, 그리고 few-shot 샘플)를 도입해 성능을 향상시키지만 BioClinicalBERT가 여전히 가장 강력한 기준선으로 남아 있다. 이 접근법은 아주 적은 학습 데이터로도 가능성을 보여준다.

ABSTRACT

Objective: This study quantifies the capabilities of GPT-3.5 and GPT-4 for clinical named entity recognition (NER) tasks and proposes task-specific prompts to improve their performance. Materials and Methods: We evaluated these models on two clinical NER tasks: (1) to extract medical problems, treatments, and tests from clinical notes in the MTSamples corpus, following the 2010 i2b2 concept extraction shared task, and (2) identifying nervous system disorder-related adverse events from safety reports in the vaccine adverse event reporting system (VAERS). To improve the GPT models' performance, we developed a clinical task-specific prompt framework that includes (1) baseline prompts with task description and format specification, (2) annotation guideline-based prompts, (3) error analysis-based instructions, and (4) annotated samples for few-shot learning. We assessed each prompt's effectiveness and compared the models to BioClinicalBERT. Results: Using baseline prompts, GPT-3.5 and GPT-4 achieved relaxed F1 scores of 0.634, 0.804 for MTSamples, and 0.301, 0.593 for VAERS. Additional prompt components consistently improved model performance. When all four components were used, GPT-3.5 and GPT-4 achieved relaxed F1 socres of 0.794, 0.861 for MTSamples and 0.676, 0.736 for VAERS, demonstrating the effectiveness of our prompt framework. Although these results trail BioClinicalBERT (F1 of 0.901 for the MTSamples dataset and 0.802 for the VAERS), it is very promising considering few training samples are needed. Conclusion: While direct application of GPT models to clinical NER tasks falls short of optimal performance, our task-specific prompt framework, incorporating medical knowledge and training samples, significantly enhances GPT models' feasibility for potential clinical applications.

연구 동기 및 목표

GPT-3.5/GPT-4의 제로샷 및 프롬샷 능력을 임상 NER 태스크에서 평가(i2b2에서 영감을 받은 태스크와 VAERS).
의료 지식과 지침을 반영하기 위한 태스크 특화 프롬프트 프레임워크 개발.
GPT 모델을 BioClinicalBERT 및 전통적 방법(CRF)과 비교.
재현성을 위한 공개 코드와 데이터셋 제공。

제안 방법

GPT-3.5-turbo-0301와 GPT-4-0314를 두 개의 임상 NER 태스크(MTSamples/VAERS)에서 평가.
BioClinicalBERT를 미세조정하고 감독 학습의 기준선으로 CRF를 구현.
네 가지 구성 요소의 프롬프트 프레임워크 개발: baseline 작업 설명, 주석 가이드라인 프롬프트, 오류 분석 지시, 주석이 달린 few-shot 샘플.
정확일치와 느슨한 일치 기준에서 정밀도, 재현율, F1 측정.
경계 및 엔터티 유형 도전 과제를 이해하기 위한 오류 분석.

실험 결과

연구 질문

RQ1제로샷 및 프롬샷 설정에서 GPT-3.5와 GPT-4가 임상 NER 태스크에서 어떻게 성능을 보이나?
RQ2태스크 특화 프롬프트 프레임워크가 LLM의 임상 NER 성능을 향상시키는가?
RQ3MTSamples와 VAERS 데이터셋에서 GPT 모델이 BioClinicalBERT와 CRF와 비교되어 어떤가?
RQ4주석된 예시(1-shot vs 5-shot)가 NER 성능에 미치는 영향은 무엇인가?

주요 결과

BioClinicalBERT는 여전히 가장 강력한 방법으로 MTSamples에서 F1 0.901 (relaxed), VAERS에서 0.802 (relaxed).
네 가지 구성 요소 프롬프트 프레임워크를 사용할 때 GPT-3.5와 GPT-4는 유의한 이득을 보이며, 5-shot 예시를 사용한 GPT-4는 MTSamples에서 0.861 (relaxed), VAERS에서 0.736 (relaxed).
GPT-4는 five-shot 프롬프트로 MTSamples에서 0.593 (exact) 및 0.861 (relaxed), VAERS에서 0.542 (exact) 및 0.736 (relaxed)에 도달.
GPT-3.5는 five-shot 프롬프트에서 MTSamples에서 0.593 (in relaxed) 및 VAERS에서 0.736 (relaxed)을 달성(연구에서 보고된 정확한 수치).
GPT-3.5 및 GPT-4는 가이드라인-, 오류 분석-, 샘플 기반 프롬프트를 추가할 때 VAERS에서 MTSamples보다 더 큰 절대 이득을 보인다.
제안된 프롬프트 방식은 최소한의 주석 데이터로 임상 NER에 LLM을 활용하는 가능성을 보여주지만 모든 설정에서 BioClinicalBERT를 능가하지는 않는다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.