QUICK REVIEW

[论文解读] Unsupervised Pre-training for Biomedical Question Answering

Vaishnavi Kommaraju, Karthick Prasad Gunasekaran|arXiv (Cornell University)|Sep 27, 2020

Topic Modeling参考文献 29被引用 37

一句话总结

本论文评估 BioBERT 和 SciBERT 在生物医学问答中的表现，并引入一个自监督去噪预训练任务，在生物医学实体提及上进行破坏，以提升 BioASQ 任务的问答性能。

ABSTRACT

We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

研究动机与目标

评估 BioBERT 和 SciBERT 在 BioASQ 的 factoid、list 和 yes/no 问答任务上的有效性。
研究从通用领域问答数据集（例如 SQuAD）向生物医学问答进行迁移学习。
提出一个自监督去噪预训练任务，利用未标注的生物医学文本来提升问答表示。
评估无监督预训练是否在 BioASQ 7b/8b 数据集上超过现有基线。

提出的方法

在 BioASQ 数据上微调 BioBERT 和 SciBERT，以处理 yes/no、factoid 和 list 问题。
结合来自 SQuAD、PubMedQA 的额外微调数据，以及去噪（无监督）数据。
开发一个自监督去噪预训练任务，其中在上下文中对生物医学实体进行破坏，模型必须使用正确的实体作为查询来定位被破坏的跨度。
可选地使用 BioSentVec 嵌入来计算相似性，并与 BioBERT/SciBERT 得分结合以增强预测。
训练任务特定的层（基于 CLS 的 yes/no；factoid/list 的起始/结束跨度），并端到端地微调整个权重。

实验结果

研究问题

RQ1在 BioASQ 7b/8b 生物医学问答任务中，BioBERT 和 SciBERT 在 yes/no、factoid 和 list 问题上的表现如何？
RQ2通过去噪目标对未标注的生物医学数据进行预训练，是否比标准微调提高问答性能？
RQ3从通用领域问答数据集（SQuAD、PubMedQA）的迁移能否提升生物医学问答的性能？
RQ4BioSentVec 嵌入对问答性能的相对贡献有多大？

主要发现

相较于基线，自监督去噪在 yes/no、factoid 和 list 问题上均提升了性能。
在多种数据配置下，BioBERT 与 SciBERT 在生物医学问答中表现相当。
使用通用领域问答数据（SQuAD、PubMedQA）进行微调可以提升生物医学问答结果。
BioSentVec 可以为 BioBERT/SciBERT 提供补充，但自身并不强大。
去噪预训练即使在有噪声的无监督数据下也能带来收益，并且需要更少的训练轮数。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。