QUICK REVIEW

[논문 리뷰] Distilling Large Language Models for Biomedical Knowledge Extraction: A Case Study on Adverse Drug Events

裕二池谷, Sheng Zhang|arXiv (Cornell University)|2023. 07. 12.

Topic Modeling인용 수 15

한 줄 요약

이 논문은 LLM 지식을 태스크 특화된 PubMedBERT 학생 모델로 증류하는 것이 라벨링 데이터 없이도 ADE 추출에서 경쟁력 있는 성능을 달성하고, 교사와 심지어 GPT-4보다도 우수하며, 모델은 1,000배 이상 더 작고 화이트박스 접근을 제공한다는 점을 보여준다.

ABSTRACT

Large language models (LLMs), such as GPT-4, have demonstrated remarkable capabilities across a wide range of tasks, including health applications. In this paper, we study how LLMs can be used to scale biomedical knowledge curation. We find that while LLMs already possess decent competency in structuring biomedical text, by distillation into a task-specific student model through self-supervised learning, substantial gains can be attained over out-of-box LLMs, with additional advantages such as cost, efficiency, and white-box model access. We conduct a case study on adverse drug event (ADE) extraction, which is an important area for improving care. On standard ADE extraction evaluation, a GPT-3.5 distilled PubMedBERT model attained comparable accuracy as supervised state-of-the-art models without using any labeled data. Despite being over 1,000 times smaller, the distilled model outperformed its teacher GPT-3.5 by over 6 absolute points in F1 and GPT-4 by over 5 absolute points. Ablation studies on distillation model choice (e.g., PubMedBERT vs BioGPT) and ADE extraction architecture shed light on best practice for biomedical knowledge extraction. Similar gains were attained by distillation for other standard biomedical knowledge extraction tasks such as gene-disease associations and protected health information, further illustrating the promise of this approach.

연구 동기 및 목표

대규모 언어 모델(LLMs)을 활용한 확장 가능한 생의학 지식 큐레이션의 동기 부여.
LLMs에서 태스크 특화된 학생 모델로의 증류가 효율성과 정확도를 개선함을 입증.
대규모 처리에 효과적인 엔드-투-엔드 ADE 추출 아키텍처를 개발.
ADE 추출을 넘어 다른 생의학 NLP 태스크에 대한 증류의 이점이 확장됨을 보임.

제안 방법

NER와 관계 추출을 한 번의 패스로 결합한 엔드-투-엔드 ADE 추출을 위한 통합 약물 중심 아키텍처를 제안한다.
약물 언급의 평균 풀링을 사용하고 약물 표현을 토큰 은닉 상태에 연결하여 약물당 ADE 토큰 분류를 가능하게 한다.
연결된 표현에 단일 선형 분류기를 시그모이드 활성화와 함께 적용하여 ADE 범위를 예측한다.
PubMed 초록에서 약물 중심의 라벨링되지 않은 코퍼스를 큐레이션하고 self-supervision을 위해 GPT-3.5 교사로 ADE 주석을 생성한다.
교사가 생성한 40,000개의 라벨링 유사 페어를 사용하여 학생 모델(PubMedBERT 및 BioGPT)으로 증류하고 제로샷/소수샷 프롬프트를 비교한다.
ADE 코퍼스(Gurulingappa 등, 2012)에서 느슨한 F1으로 평가하고 모델 선택과 아키텍처에 대한 제거 연구를 수행한다.

실험 결과

연구 질문

RQ1LLM 증류가 제로샷/소수샷 LLM 및 감독 baselines에 비해 엔드-투-엔드 ADE 추출에 얼마나 효과적인가?
RQ2생의학 지식 추출 태스크에 대한 증류 아키텍처 및 모델 선택의 영향은 무엇인가?
RQ3LLM으로부터의 증류가 유전자-질병 연관 및 PHI와 같은 다른 생의학 NLP 태스크에 일반화될 수 있는가?

주요 결과

Method	Teacher LLM	Model	Training Instances	Test F1
+LLM 즉시 사용+	-	GPT-3.5 (zero-shot)	-	78.22
+LLM 즉시 사용+	-	GPT-4 (zero-shot)	-	84.92
+LLM 즉시 사용+	-	GPT-3.5 (5-shot)	-	85.21
+LLM 즉시 사용+	-	GPT-4 (5-shot)	-	86.45
증류	GPT-3.5 (5-shot)	BioGPT	40,000	84.21
증류	GPT-3.5 (5-shot)	PubMedBERT	40,000	91.99
감독 학습	-	BioGPT	3,417	88.08
감독 학습	-	PubMedBERT	3,417	93.36

GPT-3.5로 증류된 PubMedBERT가 라벨링 데이터 없이 ADE 추출에서 감독 SOTA와 비슷한 정확도를 달성한다.
증류된 PubMedBERT(1000배 이상 작음)는 교사 GPT-3.5보다 F1에서 6포인트 이상, GPT-4보다 5포인트 이상 더 높은 성능을 기록한다.
즉시 사용 가능한 GPT-3.5와 GPT-4는 감독 모델과 경쟁력이 있지만 차이가 있으며, 증류가 격차를 크게 줄인다.
증류된 BioGPT는 ADE 태스크에서 PubMedBERT만큼의 성능을 내지 못하는데, 이는 GPT 계열이 생성 태스크에서 우수하지만 지식 추출에선 더 어려운 경향과 일치한다.
유전자-질병 연관 및 PHI와 같은 다른 생의학 태스크에서도 증류 이점이 나타나며, MedNLI는 순수 암시 entailment 태스크에서 제한된 이점을 보인다.
제거 연구는 생의학 지식 추출을 위한 증류 설계(아키텍처 및 모델)의 중요성을 강조한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.