QUICK REVIEW

[论文解读] MedErrBench: A Fine-Grained Multilingual Benchmark for Medical Error Detection and Correction with Clinical Expert Annotations

Congbo Ma, Yichun Zhang|arXiv (Cornell University)|Feb 5, 2026

Artificial Intelligence in Healthcare and Education被引用 0

一句话总结

MedErrBench 引入一个面向医学错误检测、定位与纠正的多语言基准，涵盖十个类别的临床错误类型，经过临床医生注释。它对广泛的大模型进行评估，以揭示多语言差异和需要临床为基础且语言感知的模型。

ABSTRACT

Inaccuracies in existing or generated clinical text may lead to serious adverse consequences, especially if it is a misdiagnosis or incorrect treatment suggestion. With Large Language Models (LLMs) increasingly being used across diverse healthcare applications, comprehensive evaluation through dedicated benchmarks is crucial. However, such datasets remain scarce, especially across diverse languages and contexts. In this paper, we introduce MedErrBench, the first multilingual benchmark for error detection, localization, and correction, developed under the guidance of experienced clinicians. Based on an expanded taxonomy of ten common error types, MedErrBench covers English, Arabic and Chinese, with natural clinical cases annotated and reviewed by domain experts. We assessed the performance of a range of general-purpose, language-specific, and medical-domain language models across all three tasks. Our results reveal notable performance gaps, particularly in non-English settings, highlighting the need for clinically grounded, language-aware systems. By making MedErrBench and our evaluation protocols publicly-available, we aim to advance multilingual clinical NLP to promote safer and more equitable AI-based healthcare globally. The dataset is available in the supplementary material. An anonymized version of the dataset is available at: https://github.com/congboma/MedErrBench.

研究动机与目标

为多语言医学自然语言处理评估开发临床医生知情的十类临床错误分类法。
创建并验证一个面向英语、阿拉伯语、中文的多语言基准，用于医学错误检测、定位与纠正。
在基准上评估广泛的通用型、语言专业化和医学领域的大模型，覆盖三种语言。
提供对模型局限性、跨语言泛化、提示设计与少量示例学习对性能影响的洞察。

提出的方法

将多语言临床数据分区为英语、中文、阿拉伯语，采用非翻译、多源采集的方法。
定义延伸自 MEDEC 的十型错误分类法，新增五个类别（实验/血清值解读、生理学、组织学、解剖学、流行病学），并给出定义与示例。
向笔记中注入临床基础的错误以生成用于检测、定位与纠正任务的错例与更正对。
为每个实例标注临床术语的重要性、难易程度（简单/中等/困难）以及推理类型（事实回忆、单跳推理、多跳推理）。
进行两阶段的临床医生评审以确保内容有效性和注释质量；解决分歧，确保准确性。

实验结果

研究问题

RQ1在英语、阿拉伯语、中文三种语言下，广泛的大模型在多语言医学错误检测、定位与纠正任务中的表现如何？
RQ2错误类型定义、示例提示和少量示例如何影响临床错误任务中的模型性能？
RQ3基于知识的笔记与情景/场景化笔记在多语言环境下如何影响模型能力？
RQ4跨语言的一般化能力与语言特定挑战在医学错误检测与纠正中表现为何？
RQ5当前模型存在的局限性是什么，哪些方向能提升临床为基础、语言感知的系统？

主要发现

模型	检测准确率	定位准确率	ROUGE-1	BertScore	BLEURT
GPT-4o	0.596	0.346	0.415	0.428	0.407
GPT-4o-mini	0.664	0.524	0.487	0.498	0.472
Gemini 2.5 Flash Lite	0.567	0.264	0.349	0.362	0.346
Gemini 2.0 Flash	0.514	0.168	0.281	0.294	0.288
Llama3-8b	0.519	0.361	0.266	0.261	0.282
Llama-3.3-70B-Instruct	0.582	0.255	0.369	0.369	0.385
Qwen2.5-7B-Instruct	0.563	0.490	0.372	0.450	0.371
Deepseek-R1	0.582	0.577	0.700	0.716	0.681
Deepseek-V3	0.587	0.582	0.703	0.732	0.693
Doubao-1.5	0.779	0.774	0.766	0.783	0.773
ALLAM-7B	0.029	0.014	0.015	0.020	0.014
MedGemma-4b	0.505	0.438	0.511	0.518	0.513
MedGemma-27b	0.543	0.245	0.377	0.390	0.349
HuatuoGPT-o1-7b	0.574	0.530	0.486	0.475	0.475

Doubao-1.5-thinking-pro、Deepseek-R1 与 Deepseek-V3 在多语言多任务上普遍优于其他模型。
医学领域的大模型在错误检测/纠正任务上并不总是优于通用型模型。
阿拉伯语在某些模型上的表现显著较弱，反映低资源语言场景的领域适应性差距。
提供错误类型定义与少量示例通常提升性能，定义在零-shot 场景中特别有益。
在不同语言中，本地化与纠正比检测更具挑战性，提示设计对模型有特定影响。
人工评估显示经临床医生注释的中文样本在某些模型（如 Gemini 2.0 Flash）上表现优于其他模型（如 GPT-4o-mini）。

Figure 2: Distribution of difficulty level and reasoning type.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。