QUICK REVIEW

[论文解读] A pre-training technique to localize medical BERT and enhance BioBERT.

Shoya Wada, Toshihiro Takeda|arXiv (Cornell University)|May 14, 2020

Biomedical Text Mining and Ontologies被引用 9

一句话总结

该论文提出一种预训练技术，通过在英语和日语的小型医学语料上微调，以增强BioBERT在低资源生物医学语言中的表现。通过利用有限的高质量医学文本，该方法生成ouBioBERT，在10个数据集上的BLUE基准测试中相比BioBERT提升1.0分，展现出在生物医学语言理解方面的优越性能。

ABSTRACT

Bidirectional Encoder Representations from Transformers (BERT) models for biomedical specialties such as BioBERT and clinicalBERT have significantly improved in biomedical text-mining tasks and enabled us to extract valuable information from biomedical literature. However, we benefitted only in English because of the significant scarcity of high-quality medical documents, such as PubMed, in each language. Therefore, we propose a method that realizes a high-performance BERT model by using a small corpus. We introduce the method to train a BERT model on a small medical corpus both in English and Japanese, respectively, and then we evaluate each of them in terms of the biomedical language understanding evaluation (BLUE) benchmark and the medical-document-classification task in Japanese, respectively. After confirming their satisfactory performances, we apply our method to develop a model that outperforms the pre-existing models. Bidirectional Encoder Representations from Transformers for Biomedical Text Mining by Osaka University (ouBioBERT) achieves the best scores on 7 of the 10 datasets in terms of the BLUE benchmark. The total score is 1.0 points above that of BioBERT.

研究动机与目标

解决预训练BERT模型时高质量多语言生物医学文本稀缺的问题。
仅使用日语等低资源语言的小型医学语料，开发高性能的BERT模型。
通过适应BERT预训练技术，提升低资源环境下的生物医学语言理解能力。
在基准评估中超越现有模型（如BioBERT和clinicalBERT）

提出的方法

在英语和日语的小型但高质量的生物医学语料上预训练BERT模型。
通过领域特定文本，对标准BERT架构进行调整，以聚焦于生物医学术语和上下文。
在生物医学语言理解任务上微调模型，以增强领域特定表征能力。
在BLUE基准测试和日语医学文档分类任务上评估模型性能，以验证其有效性。
使用双向注意力机制捕捉医学文本中的上下文依赖关系。
通过在有限医学语料上进行掩码语言建模和下一句预测来优化模型。

实验结果

研究问题

RQ1当在低资源语言的小型生物医学语料上进行预训练时，BERT模型能否实现高性能？
RQ2在有限医学文本上预训练的模型与现有BioBERT和clinicalBERT模型相比，性能如何？
RQ3领域特定的预训练在多大程度上能提升低资源环境下的生物医学语言理解能力？
RQ4仅使用小型、高质量医学语料对下游任务性能有何影响？

主要发现

ouBioBERT在BLUE基准测试的10个数据集中的7个上表现最佳，优于BioBERT。
ouBioBERT在BLUE基准测试中的总分比BioBERT高出1.0分。
该模型在日语医学文档分类任务中表现出色，证实其在低资源环境下的有效性。
在小型高质量医学语料上进行预训练，其结果与在更大规模通用领域语料上训练的模型相比具有竞争力。
该方法即使在训练数据有限的情况下，也能有效提升生物医学语言理解能力。
该方法在英语和日语中均表现有效，显示出生物医学自然语言处理中的跨语言迁移能力。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。