QUICK REVIEW

[论文解读] A Comparison of Word Embeddings for the Biomedical Natural Language Processing

Yanshan Wang, Sijia Liu|arXiv (Cornell University)|Feb 1, 2018

Text Readability and Simplification被引用 28

一句话总结

本研究评估了在四种不同语料库（电子健康记录（EHR）、生物医学文献（MedLit）、维基百科和新闻）上训练的词嵌入在生物医学自然语言处理中的表现。通过在临床信息抽取、生物医学检索和关系抽取任务中进行内在评估与外在评估，研究发现基于EHR和MedLit训练的词嵌入更能捕捉医学语义，在临床场景下优于通用领域嵌入（如GloVe和Google News），尽管在所有任务中并无单一嵌入始终占优。

ABSTRACT

Word embeddings have been widely used in biomedical Natural Language Processing (NLP) applications as they provide vector representations of words capturing the semantic properties of words and the linguistic relationship between words. Many biomedical applications use different textual resources (e.g., Wikipedia and biomedical articles) to train word embeddings and apply these word embeddings to downstream biomedical applications. However, there has been little work on evaluating the word embeddings trained from these resources.In this study, we provide an empirical evaluation of word embeddings trained from four different resources, namely clinical notes, biomedical publications, Wikipedia, and news. We performed the evaluation qualitatively and quantitatively. For the qualitative evaluation, we manually inspected five most similar medical words to a given set of target medical words, and then analyzed word embeddings through the visualization of those word embeddings. For the quantitative evaluation, we conducted both intrinsic and extrinsic evaluation. Based on the evaluation results, we can draw the following conclusions. First, the word embeddings trained on clinical notes and biomedical publications can capture the semantics of medical terms better, and find more relevant similar medical terms, and are closer to human experts' judgments, compared to these trained on Wikipedia and news. Second, there does not exist a consistent global ranking of word embedding quality for downstream biomedical NLP applications. However, adding word embeddings as extra features will improve results on most downstream tasks. Finally, the word embeddings trained on biomedical domain corpora do not necessarily have better performance than those trained on other general domain corpora for any downstream biomedical NLP tasks.

研究动机与目标

评估在多样化语料库（EHR、生物医学文献、维基百科和新闻）上训练的词嵌入在生物医学NLP应用中的表现。
确定在特定生物医学语料库上训练的嵌入是否优于来自通用领域来源（如维基百科和新闻）的嵌入。
评估词嵌入作为下游生物医学NLP任务（如信息抽取、信息检索和关系抽取）中特征的影响。
探究本地机构特定的EHR数据是否能生成优于公开预训练嵌入的临床NLP任务嵌入。
考察词嵌入在不同生物医学NLP应用和机构之间的泛化性与可移植性。

提出的方法

使用梅奥诊所的非结构化EHR数据和PubMed Central（MedLit）文章，采用带有负采样的跳字模型训练词嵌入。
将GloVe和Google News的公开预训练嵌入作为基线比较。
通过手动检查选定医学术语（疾病、症状、药物）的五个最相似词，并将377个医学术语可视化在二维空间中，进行定性评估。
使用四个基准数据集（Pedersen、Hliaoutakis、MayoSRS、UMNSRS）进行内在评估，以衡量医学术语之间的语义相似性。
在外在评估中，针对三项下游任务展开：临床信息抽取（来自BioCreative V IE挑战）、生物医学信息检索（BioASQ挑战）和关系抽取（BioCreative V RE挑战）。
报告各项任务中的F1分数，以比较不同语料库训练的嵌入性能，其中嵌入作为机器学习模型中的附加特征使用。

实验结果

研究问题

RQ1在临床病历（EHR）和生物医学文献（MedLit）上训练的词嵌入是否比在通用领域语料库（如维基百科和新闻）上训练的嵌入更准确地捕捉医学语义？
RQ2在不同的下游生物医学NLP任务中，词嵌入是否存在一致的排名顺序，还是性能因任务而异？
RQ3在非生物医学、通用领域语料库（如新闻、维基百科）上训练的嵌入是否能在性能上与在生物医学专用语料库上训练的嵌入相媲美甚至更优？
RQ4与公开预训练嵌入相比，机构特定的EHR嵌入在本地临床NLP任务中的性能提升程度如何？
RQ5在多样化的生物医学NLP应用中，将词嵌入作为额外特征是否能持续提升性能？

主要发现

基于EHR训练的词嵌入在临床信息抽取任务中取得了最高的F1分数（0.900），优于所有其他嵌入。
MedLit训练的嵌入在捕捉医学语义方面也表现出色，在IE任务中F1得分为0.889，在RE任务中为0.788。
内在评估证实，EHR训练的嵌入在所有四个数据集（Pedersen、Hliaoutakis、MayoSRS、UMNSRS）上的语义相似性得分最接近人类专家判断。
在生物医学信息检索任务中，没有任何词嵌入能提升基线性能，表明在此特定设置下词嵌入的收益有限。
Google News嵌入在关系抽取任务中取得了最佳F1分数（0.790），优于EHR和MedLit嵌入。
尽管语料库领域不同，来自通用领域来源（GloVe和Google News）的嵌入在某些任务中表现与或优于生物医学语料库训练的嵌入，表明特定领域嵌入并无一致优势。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。