QUICK REVIEW

[论文解读] A Comparative Study of Transformer-Based Language Models on Extractive Question Answering

Kate Pearce, Tiffany Zhan|arXiv (Cornell University)|Oct 7, 2021

Topic Modeling参考文献 20被引用 22

一句话总结

本研究在多种数据集上评估了基于预训练变换器的文本模型在抽取式问答任务中的表现，对比了RoBERTa、BART、BERT、ALBERT、XLNet和ConvBERT。研究提出了一种BERT-BiLSTM集成模型，提升了泛化能力，其中RoBERTa和BART在所有数据集上取得了最高的F1分数，而BERT-BiLSTM模型在所有数据集上的F1分数均比BERT高出至少1%。

ABSTRACT

Question Answering (QA) is a task in natural language processing that has seen considerable growth after the advent of transformers. There has been a surge in QA datasets that have been proposed to challenge natural language processing models to improve human and existing model performance. Many pre-trained language models have proven to be incredibly effective at the task of extractive question answering. However, generalizability remains as a challenge for the majority of these models. That is, some datasets require models to reason more than others. In this paper, we train various pre-trained language models and fine-tune them on multiple question answering datasets of varying levels of difficulty to determine which of the models are capable of generalizing the most comprehensively across different datasets. Further, we propose a new architecture, BERT-BiLSTM, and compare it with other language models to determine if adding more bidirectionality can improve model performance. Using the F1-score as our metric, we find that the RoBERTa and BART pre-trained models perform the best across all datasets and that our BERT-BiLSTM model outperforms the baseline BERT model.

研究动机与目标

评估多种预训练变换器模型在不同复杂度数据集上的抽取式问答任务中的泛化性能。
探究在BERT架构中增加双向长短期记忆（BiLSTM）层是否能提升抽取式问答任务的性能。
评估数据集难度（从直接抽取答案的SQuAD到需要复杂推理的QuAC、NewsQA、CovidQA）对模型泛化能力的影响。
通过在多个数据集和模型变体上比较F1分数，确定抽取式问答任务中最有效的模型架构。

提出的方法

在四个抽取式问答数据集（SQuAD 2.0、QuAC、NewsQA和CovidQA）上微调了RoBERTa、BART、BERT、ALBERT、XLNet和ConvBERT的基础版本。
通过拼接上下文和问题构建输入序列，使用WordPiece和SentencePiece分词器进行分词，并将序列截断至最大512个标记。
通过在BERT基础模型的上下文表示之上堆叠BiLSTM层，构建了一种新颖的BERT-BiLSTM集成模型，以提升序列建模能力。
使用Adam优化器，固定初始学习率为5e-5，批量大小为8，在NVIDIA 2x Quadro RTX 8000 GPU上训练3个周期。
使用F1分数评估模型性能，其计算方式为预测起始和结束标记跨度的精确率与召回率的调和平均数。
将所有输入标准化为小写，并使用统一的分词方式以确保在不同数据集和模型之间的一致性。

实验结果

研究问题

RQ1在不同难度的抽取式问答数据集中，哪种预训练变换器语言模型的泛化能力最强？
RQ2在BERT架构中增加BiLSTM层对抽取式问答任务性能有何影响？
RQ3RoBERTa和BART在包括需要推理能力的各类问答基准上，相较于其他模型的优越性体现在多大程度上？
RQ4为何模型在长上下文数据集（如CovidQA）上表现较差，上下文长度如何影响模型性能？

主要发现

RoBERTa和BART在所有四个数据集上均取得了最高的F1分数，表明其在抽取式问答任务中具有更强的泛化能力和鲁棒性。
BERT-BiLSTM模型在每个数据集上的F1分数均比基础BERT模型高出至少1%，证明额外的双向建模能有效提升性能。
模型在SQuAD 2.0上表现最佳，因其答案直接可抽取且上下文较短；而在QuAC上表现显著下降，因其问题具有开放性且需要复杂推理。
NewsQA的表现强劲，仅次于SQuAD，表明RoBERTa和BART能有效处理复杂推理任务。
CovidQA数据集因上下文更长且训练数据有限，导致模型性能较差，尤其对最大序列长度固定为512个标记的模型影响显著。
RoBERTa因未包含下一句预测任务，其性能更优，因其更契合抽取问答中与掩码语言建模相关的跨度预测目标。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。