QUICK REVIEW

[论文解读] Detecting Quality Problems in Research Data: A Model-Driven Approach

Arno Kesper, Viola Wenz|arXiv (Cornell University)|Jan 1, 2020

Data Quality and Management参考文献 21被引用 1

一句话总结

本文提出一种模型驱动的方法，通过定义可复用的通用分析模式，检测研究数据中的质量问题，这些模式可针对特定的数据库技术与数据格式进行具体化。该方法使领域专家能够将其抽象模式适配至自身数据，数据分析师则利用生成的具体模式识别重复、不一致性和结构缺陷等问题——在大型文化遗产数据库上实现85%的质量问题覆盖，运行时间低于20秒。

ABSTRACT

As scientific progress highly depends on the quality of research data, there are strict requirements for data quality coming from the scientific community. A major challenge in data quality assurance is to localise quality problems that are inherent to data. Due to the dynamic digitalisation in specific scientific fields, especially the humanities, different database technologies and data formats may be used in rather short terms to gain experiences. We present a model-driven approach to analyse the quality of research data. It allows abstracting from the underlying database technology. Based on the observation that many quality problems show anti-patterns, a data engineer formulates analysis patterns that are generic concerning the database format and technology. A domain expert chooses a pattern that has been adapted to a specific database technology and concretises it for a domain-specific database format. The resulting concrete patterns are used by data analysts to locate quality problems in their databases. As proof of concept, we implemented tool support that realises this approach for XML databases. We evaluated our approach concerning expressiveness and performance in the domain of cultural heritage based on a qualitative study on quality problems occurring in cultural heritage data.

研究动机与目标

为解决在数字人文等动态演化的领域中检测研究数据固有质量问题的挑战。
开发一种抽象于底层数据库技术与数据格式的方法，实现跨多种系统的复用。
通过基于模式的分析，支持数据分析师系统性地识别重复、不一致性和结构缺陷等质量问题。
基于真实的文化遗产数据，从表达能力和性能两方面评估该方法。

提出的方法

该方法定义了与数据库技术或数据格式无关的通用分析模式（反模式），用于常见数据质量问题。
领域专家将这些抽象模式具体化为特定数据格式（如XML）和数据库技术的实现，以适配领域特定约束。
将具体模式转换为XQuery，使用Eclipse Modelling Framework在XML数据库上执行。
该方法支持基于模式的缺陷检测，包括完全/近似重复、冗余数据、语义错误和结构不一致。
通过工具支持实现模式应用与真实数据集执行的自动化。
采用定性研究与性能基准测试，在文化遗产数据库上对方法进行评估。

实验结果

研究问题

RQ1如何以独立于特定数据库技术与数据格式的方式，对研究数据中的质量问题进行建模？
RQ2通用分析模式在文化遗产数据库中能多大程度上检测真实世界的数据质量问题？
RQ3该基于模式的检测方法在大规模研究数据上的效率如何？
RQ4该模型驱动方法能否在不同数据格式与数据库系统间有效适配与复用？

主要发现

该方法在文化遗产数据中识别出的数据质量问题变体上实现了85%的覆盖度，证明了其强大的表达能力。
在43个模式中，80%的模式在大型数据库上的查询执行时间低于20秒，表明其具有高效的性能表现。
该方法成功检测到真实XML数据库中的完全与近似重复、冗余数据以及结构不一致等问题。
基于Eclipse Modelling Framework的工具支持，实现了模式到可执行XQuery的无缝转换。
该方法可适配多种数据格式，已成功为MIDAS与LIDO两种数据格式具体化了相关模式。
拼写错误与错误的语义值因模式表达能力的局限性未能覆盖，凸显了当前的研究空白。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。