QUICK REVIEW

[논문 리뷰] MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations

Congbo Ma, Yichun Zhang|arXiv (Cornell University)|2026. 02. 05.

Artificial Intelligence in Healthcare and Education인용 수 0

한 줄 요약

MedErrBench은 영어, 아랍어, 중국어로 의학 오류 탐지, 위치지정, 수정에 대한 다국어 벤치마크를 도입하며 임상의가 주석한 오류 유형이 10가지 범주로 제공됩니다. 이는 다양한 LLM을 평가하여 다국어 간 격차와 임상적으로 기반한, 언어 인식 모델의 필요성을 드러냅니다.

ABSTRACT

Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.

연구 동기 및 목표

다국어 의료 NLP 평가를 위한 MEDEC를 확장한 10가지 임상 오류 유형의 임상의 지향 분류체계 개발.
의료 오류 탐지, 위치지정, 수정에 대한 다국어 벤치마크(영어, 아랍어, 중국어) 생성 및 검증.
세 언어에 걸쳐 벤치마크에 대해 광범위한 범용, 언어 특화, 의료 도메인 LLM을 평가.
모델의 한계, 언어 간 일반화, 프롬프트 및 소수 샷 학습이 성능에 미치는 영향에 대한 통찰 제공.

제안 방법

비번역 다중 소스 수집을 사용하여 영어, 중국어, 아랍어로 다국어 임상 데이터를 분할.
10형 오류 분류체계를 MEDEC를 확장하여 다섯 가지 신규 범주(실험실/혈청 값 해석, 생리학, 조직학, 해부학, 역학)를 추가하고 정의와 예시를 제공.
임상의 기반 오류를 노트에 삽입하여 탐지, 위치지정, 수정 과제를 위한 오류 포함 페어를 생성.
각 사례를 임상 용어의 중요도, 난이도(쉬움/보통/어려움), 추론 유형(사실 기억, 단일 히프, 다중 히프)으로 주석화.
내용 타당도와 주석 품질에 대해 두 단계의 임상의 검토 수행; 이견 해결 및 정확성 확보.

실험 결과

연구 질문

RQ1영어, 아랍어, 중국어에 걸친 다국어 의학 오류 탐지, 위치지정, 수정에서 광범위한 범위의 LLM 성능은 어떠한가?
RQ2오류 유형 정의, 예시 프롬프트, 소수샷 예제가 임상 오류 작업에서 모델 성능에 어떤 영향을 미치는가?
RQ3지식 기반의 임상 노트와 상황 기반 임상 노트가 다국어 설정에서 모델 역량에 어떤 차이를 보이는가?
RQ4의료 오류 탐지 및 수정에 대한 다국어 일반화 능력과 언어별 도전 과제는 무엇인가?
RQ5현재 모델의 한계는 무엇이며 임상적으로 기반하고 언어를 인지하는 시스템을 향상시키려면 어떤 방향이 필요한가?

주요 결과

모델	탐지 정확도	위치지정 정확도	ROUGE-1	BertScore	BLEURT
GPT-4o	0.596	0.346	0.415	0.428	0.407
GPT-4o-mini	0.664	0.524	0.487	0.498	0.472
Gemini 2.5 Flash Lite	0.567	0.264	0.349	0.362	0.346
Gemini 2.0 Flash	0.514	0.168	0.281	0.294	0.288
Llama3-8b	0.519	0.361	0.266	0.261	0.282
Llama-3.3-70B-Instruct	0.582	0.255	0.369	0.369	0.385
Qwen2.5-7B-Instruct	0.563	0.490	0.372	0.450	0.371
Deepseek-R1	0.582	0.577	0.700	0.716	0.681
Deepseek-V3	0.587	0.582	0.703	0.732	0.693
Doubao-1.5	0.779	0.774	0.766	0.783	0.773
ALLAM-7B	0.029	0.014	0.015	0.020	0.014
MedGemma-4b	0.505	0.438	0.511	0.518	0.513
MedGemma-27b	0.543	0.245	0.377	0.390	0.349
HuatuoGPT-o1-7b	0.574	0.530	0.486	0.475	0.475

Doubao-1.5-thinking-pro, Deepseek-R1, and Deepseek-V3은 여러 언어에서 여러 작업에 대해 다른 모델들보다 우수한 성능을 보인다.
의료 도메인 LLM은 오류 탐지/수정 과제에서 일반 목적 모델보다 일관되게 우수하지 않다.
아랍어 성능은 일부 모델에서 현저히 약하며, 저자원 언어 설정에서 도메인 적응 격차를 강조한다.
오류 유형 정의와 소수 샷 예시를 제공하는 것이 일반적으로 성능을 개선하며, 정의는 특히 제로샷 환경에서 유익하다.
로컬라이제이션과 수정은 탐지보다 더 도전적이며 프롬프트 설계가 모델별 효과를 보인다.
인간 평가에서 의론적 주석이 달린 중국어 샘플은 특정 모델(예: Gemini 2.0 Flash)이 다른 모델(예: GPT-4o-mini)보다 선호되는 경향을 보였다.

Figure 2: Distribution of difficulty level and reasoning type.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.