QUICK REVIEW

[论文解读] DeepER -- Deep Entity Resolution

Muhammad Ebraheem, Saravanan Thirumuruganathan|arXiv (Cornell University)|Oct 2, 2017

Data Quality and Management参考文献 50被引用 56

一句话总结

DeepER 引入了用于实体解析的元组分布式表示（DRs），通过词嵌入和基于 LSTM 的组合性来减少标注数据，并采用基于 LSH 的分块以提高效率。它在基准测试和多语言数据上展示了具竞争力的准确性。

ABSTRACT

Entity resolution (ER) is a key data integration problem. Despite the efforts in 70+ years in all aspects of ER, there is still a high demand for democratizing ER - humans are heavily involved in labeling data, performing feature engineering, tuning parameters, and defining blocking functions. With the recent advances in deep learning, in particular distributed representation of words (a.k.a. word embeddings), we present a novel ER system, called DeepER, that achieves good accuracy, high efficiency, as well as ease-of-use (i.e., much less human efforts). For accuracy, we use sophisticated composition methods, namely uni- and bi-directional recurrent neural networks (RNNs) with long short term memory (LSTM) hidden units, to convert each tuple to a distributed representation (i.e., a vector), which can in turn be used to effectively capture similarities between tuples. We consider both the case where pre-trained word embeddings are available as well the case where they are not; we present ways to learn and tune the distributed representations. For efficiency, we propose a locality sensitive hashing (LSH) based blocking approach that uses distributed representations of tuples; it takes all attributes of a tuple into consideration and produces much smaller blocks, compared with traditional methods that consider only a few attributes. For ease-of-use, DeepER requires much less human labeled data and does not need feature engineering, compared with traditional machine learning based approaches which require handcrafted features, and similarity functions along with their associated thresholds. We evaluate our algorithms on multiple datasets (including benchmarks, biomedical data, as well as multi-lingual data) and the extensive experimental results show that DeepER outperforms existing solutions.

研究动机与目标

在保持高准确性的同时，减少实体解析中的人工标注和特征工程。
在不进行大量特征工程的情况下，呈现元组之间的句法和语义相似性。
提供一个整体的、基于 DR 的分块方法，以高效地限制跨所有属性的比较。

提出的方法

通过对词嵌入取平均或使用带有 LSTM 单元的一元/双向 LSTM 进行组合来计算元组的 DR。
训练端到端模型以对特定 ER 任务微调 DR 以提高准确性。
使用元组 DR 之间的相似性向量作为分类器的输入来做匹配/不匹配的决策。
引入基于 LSH 的分块，使用元组的 DR 来形成覆盖所有属性的分组。
解释如何处理词汇表外单词以及部分或极少字典覆盖的情形。
讨论通过微调或将词嵌入改造以适应领域资源来对词嵌入进行调优。

实验结果

研究问题

RQ1元组的 DRs 是否能够在无需大量特征工程的情况下捕捉到有效实体解析所需的句法和语义相似性？
RQ2如何使用 DRs 和 LSH 高效实现 ER 分块，以扩展到大规模数据集？
RQ3哪些策略（平均化 vs. 基于 LSTM 的组合）在不同数据集上能获得更好的实体解析性能？
RQ4如何将词嵌入适应或调优以应对领域特定的实体解析任务（完全、部分或最小覆盖）？

主要发现

基于 DR 的元组表示使实体解析的相似性测量在无需大量手工特征工程的情况下也能有效。
基于 LSTM 的组合式 DR 在词序和属性交互重要的数据集上可提供优势。
基于 DR 的 LSH 分块在显著减少比较次数的同时，利用了跨所有属性的语义相似性。
通过有监督学习进行端到端的 DR 调优可以提高特定任务数据上的实体解析准确性。
词汇表改造和领域特定的嵌入策略有助于解决完全/部分/最小覆盖情形。
实验表明 DeepER 在基准、生物医学和多语言数据集上优于现有方法。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。