QUICK REVIEW

[论文解读] Natural Language Understanding with the Quora Question Pairs Dataset

Lakshay Sharma, Laura Graesser|arXiv (Cornell University)|Jul 1, 2019

Topic Modeling参考文献 20被引用 57

一句话总结

本论文通过对 Quora 数据集的重复问题检测研究自然语言理解，发现一个简单的 Continuous Bag of Words 模型优于更复杂的循环/注意力模型，并注意到标签主观性。

ABSTRACT

This paper explores the task Natural Language Understanding (NLU) by looking at duplicate question detection in the Quora dataset. We conducted extensive exploration of the dataset and used various machine learning models, including linear and tree-based models. Our final finding was that a simple Continuous Bag of Words neural network model had the best performance, outdoing more complicated recurrent and attention based models. We also conducted error analysis and found some subjectivity in the labeling of the dataset.

研究动机与目标

通过对 Quora 数据集的重复问题检测来研究自然语言理解。
评估从线性到基于树的以及神经网络架构的多种机器学习模型。
确定哪种建模方法在这个 NLU 任务上提供最佳性能。
进行错误分析以理解标注主观性和数据集局限性。

提出的方法

在 Quora 重复问题任务上实验线性、基于树的和神经网络模型。
以 CBOW 神经网络作为基线并与循环和注意力为基础的模型进行比较。
进行经验评估以评估模型在该数据集上的性能。
进行错误分析以检查 Quora 数据集中的标注主观性。

实验结果

研究问题

RQ1哪一种机器学习模型族（线性、基于树的、神经网络）在 Quora 重复问题检测中提供最佳性能？
RQ2在此任务中，简单的 CBOW 模型是否优于更复杂的循环/注意力为基础的模型？
RQ3哪些标注问题或主观性影响 Quora 数据集及模型评估？

主要发现

在所探索的模型中，简单的 Continuous Bag of Words 神经网络实现了最佳性能。
在该任务中，更复杂的循环和基于注意力的模型未能超过 CBOW。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。