QUICK REVIEW

[论文解读] CBLUE: A Chinese Biomedical Language Understanding Evaluation Benchmark

Ningyu Zhang, Mosha Chen|arXiv (Cornell University)|Jun 15, 2021

Topic Modeling被引用 24

一句话总结

CBLUE 引入首个中文生物医学语言理解基准，涵盖八个任务，评估 11 个中文预训练模型与人类，凸显模型与人类表现之间存在的大差距。

ABSTRACT

Artificial Intelligence (AI), along with the recent progress in biomedical language understanding, is gradually changing medical practice. With the development of biomedical language understanding benchmarks, AI applications are widely used in the medical field. However, most benchmarks are limited to English, which makes it challenging to replicate many of the successes in English for other languages. To facilitate research in this direction, we collect real-world biomedical data and present the first Chinese Biomedical Language Understanding Evaluation (CBLUE) benchmark: a collection of natural language understanding tasks including named entity recognition, information extraction, clinical diagnosis normalization, single-sentence/sentence-pair classification, and an associated online platform for model evaluation, comparison, and analysis. To establish evaluation on these tasks, we report empirical results with the current 11 pre-trained Chinese models, and experimental results show that state-of-the-art neural models perform by far worse than the human ceiling. Our benchmark is released at \url{https://tianchi.aliyun.com/dataset/dataDetail?dataId=95414&lang=en-us}.

研究动机与目标

介绍一个覆盖多样化生物医学任务的中文生物医学语言理解评估（CBLUE）基准。
从多个来源收集真实世界的、匿名化的中文生物医学数据，以反映行业分布。
提供一个在线平台和基线，用于评估、比较和分析在 CBLUE 任务上的模型表现。
分析中文生物医学NLP中的语言学与领域特定挑战，并为未来模型发展提供指导。

提出的方法

汇编覆盖标记级、序列级和句对分类的八个生物医学NLU任务。
从临床试验、电子健康记录、医学论坛、教科书和搜索引擎日志收集数据，并进行隐私保护的匿名化处理。
由领域专家进行数据标注，进行质量控制，包括评注者间一致性评估。
发布一个带排行榜的开放平台，并提供60小时的免费GPU用于鼓励社区参与。
使用11个公开的中文预训练语言模型并进行标准微调，提供可复现的基线。
提供用 PyTorch 编写的代码以复现实验基线和结果。

实验结果

研究问题

RQ1当前中文预训练语言模型在多样化中文生物医学任务上的表现如何？
RQ2数据源与分布（包括长尾分布和非独立同分布转移场景）如何影响中文生物医学NLP中的模型泛化能力？
RQ3模型在 CBLUE 任务上的表现与人类表现有多接近，在哪些方面差距最大？
RQ4中文生物医学任务中模型面临的主要错误类型与语言学挑战有哪些？

主要发现

模型	CMeEE	CMeIE	CDN	CTC	STS	QIC	QTR	QQR	平均值
BERT-base	69.1	-	-	-	-	-	-	-	69.1
BERT-wwm-ext-base	69.4	-	-	-	-	-	-	-	69.4
RoBERTa-large	69.6	-	-	-	-	-	-	-	69.6
RoBERTa-wwm-ext-base	69.3	-	-	-	-	-	-	-	69.3
RoBERTa-wwm-ext-large	70.0	-	-	-	-	-	-	-	70.0
ALBERT-tiny	61.1	-	-	-	-	-	-	-	61.1
ALBERT-xxlarge	66.1	-	-	-	-	-	-	-	66.1
ZEN	68.4	-	-	-	-	-	-	-	68.4
MacBERT-base	69.0	-	-	-	-	-	-	-	69.0
MacBERT-large	69.6	-	-	-	-	-	-	-	69.6
PCL-MedBERT	67.9	-	-	-	-	-	-	-	67.9
Human	77.1	-	-	-	-	-	-	-	77.1

目前最先进的中文模型在 CBLUE 任务上的表现远落后于人类（人类平均 77.1 vs 模型平均约 66-70，跨任务）。
通常模型越大表现越好，但提升取决于任务，并非对所有任务都普遍适用。
全词 masking 和特定医学领域预训练并未在所有任务上统一提升性能，表明中文生物医学NLP存在任务特定挑战。
转移学习场景（非独立同分布/CHIP-STS 风格）显现训练与测试分布之间显著的泛化差距。
案例研究显示由于歧义、对领域知识的需求、实体重叠、口语化语言和标注问题而产生的错误，凸显中文生物医学文本在语言学与领域特定方面的复杂性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。