QUICK REVIEW

[論文レビュー] Unsupervised Pre-training for Biomedical Question Answering

Vaishnavi Kommaraju, Karthick Prasad Gunasekaran|arXiv (Cornell University)|Sep 27, 2020

Topic Modeling参考文献 29被引用数 37

ひとこと要約

この論文はBioBERTとSciBERTを生物医用QAで評価し、自己教師付きのデノイジング前訓練タスクを導入して生物医用エンティティの言及を破損させ、BioASQタスクでのQA性能を向上させる。

ABSTRACT

We explore the suitability of unsupervised representation learning methods on biomedical text -- BioBERT, SciBERT, and BioSentVec -- for biomedical question answering. To further improve unsupervised representations for biomedical QA, we introduce a new pre-training task from unlabeled data designed to reason about biomedical entities in the context. Our pre-training method consists of corrupting a given context by randomly replacing some mention of a biomedical entity with a random entity mention and then querying the model with the correct entity mention in order to locate the corrupted part of the context. This de-noising task enables the model to learn good representations from abundant, unlabeled biomedical text that helps QA tasks and minimizes the train-test mismatch between the pre-training task and the downstream QA tasks by requiring the model to predict spans. Our experiments show that pre-training BioBERT on the proposed pre-training task significantly boosts performance and outperforms the previous best model from the 7th BioASQ Task 7b-Phase B challenge.

研究の動機と目的

BioBERTとSciBERTがBioASQのfactoid、list、yes/no QAタスクでどの程度有効かを評価する。
一般ドメインのQAデータセット（例：SQuAD）から生物医用QAへの転移学習を調査する。
ラベルなしの生物医用テキストを用いた自己教師付きデノイジング前訓練タスクを提案し、QA表現を改善する。
unsupervised pre-trainingがBioASQ 7b/8bデータセット上の既存ベースラインより gains をもたらすかを評価する。

提案手法

BioBERTとSciBERTをBioASQデータでyes/no、factoid、list質問に対してファインチューニングする。
SQuAD、PubMedQA、およびデノイジング（unsupervised）データから追加のファインチューニングデータを取り入れる。
生物医用エンティティが文脈中で破損され、正しいエンティティをクエリとして用いて破損領域を特定する自己教師付きデノイジング前訓練タスクを開発する。
BioSentVecエンベディングを用いて類似性を計算し、BioBERT/SciBERTのスコアと組み合わせることで予測を補強する。
タスク特異的な層を訓練（CLSベースのyes/no、factoid/listはstart/end span）、全ウェイトをエンドツーエンドで微調整する。

実験結果

リサーチクエスチョン

RQ1BioBERTとSciBERTはBioASQ 7b/8bの生物医用QAタスク（yes/no、factoid、list質問を含む）でどのように性能を示すか？
RQ2デノイジング目的でラベルなし生物医用データを事前学習すると、標準的なファインチューニングと比較してQA性能は向上するか？
RQ3一般ドメインQAデータセット（SQuAD、PubMedQA）からの転移は生物医用QAの性能を向上させるか？
RQ4BioSentVecエンベディングのQA性能への寄与は相対的にどの程度か？

主な発見

自己教師付きデノイジングはyes/no、factoid、listの各質問でベースラインと比較して性能を向上させる。
BioBERTとSciBERTは複数のデータ構成下で生物医用QAにおいて同等の性能を示す。
一般ドメインQAデータ（SQuAD、PubMedQA）でのファインチューニングは生物医用QAの結果を向上させる。
BioSentVecはBioBERT/SciBERTを補完できるが、それ自体では強力ではない。
デノイジング前訓練はノイズの多い無監督データでも gains を生み、エポック数を減らす。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。