QUICK REVIEW

[論文レビュー] A Survey on Data Quality Dimensions and Tools for Machine Learning

Yuhan Zhou, Fengjiao Tu|arXiv (Cornell University)|Jun 28, 2024

Data Quality and Management被引用数 5

ひとこと要約

本論文は過去5年間のデータ品質評価/改善ツール17件をレビューし、機械学習に焦点を当てた4つのデータ品質次元と12の指標を定義し、オープンソースツール開発のロードマップとLLMsなどの将来動向を提示する。

ABSTRACT

Machine learning (ML) technologies have become substantial in practically all aspects of our society, and data quality (DQ) is critical for the performance, fairness, robustness, safety, and scalability of ML models. With the large and complex data in data-centric AI, traditional methods like exploratory data analysis (EDA) and cross-validation (CV) face challenges, highlighting the importance of mastering DQ tools. In this survey, we review 17 DQ evaluation and improvement tools in the last 5 years. By introducing the DQ dimensions, metrics, and main functions embedded in these tools, we compare their strengths and limitations and propose a roadmap for developing open-source DQ tools for ML. Based on the discussions on the challenges and emerging trends, we further highlight the potential applications of large language models (LLMs) and generative AI in DQ evaluation and improvement for ML. We believe this comprehensive survey can enhance understanding of DQ in ML and could drive progress in data-centric AI. A complete list of the literature investigated in this survey is available on GitHub at: https://github.com/haihua0913/awesome-dq4ml.

研究の動機と目的

ML向けの4つのデータ品質次元を定義・統合し、それらを実用的な指標へ対応づける。
過去5年間に開発された17件のオープンソースデータ品質ツールを調査し、それらの機能を比較する。
MLのデータ品質を評価・改善する際の課題を分析し、ツール開発のロードマップを提案する。
MLのデータ品質における大規模言語モデルや生成型AIの応用などの新たな動向を議論する。

提案手法

既存文献からMLに適用可能なDQの次元と指標を特定・統合する。
17件のオープンソースDQ評価/改善ツールを整理・分類し、それらの核心機能と指標を抽出する。
次元・指標・更新時期を横断してツールの比較分析を実施する。
MLおよびデータ中心AIに合わせたオープンソースDQツール設計のロードマップを提案する。
MLにおけるDQ評価/改善におけるLLMsと生成型AIの役割を議論する。

Figure 1: Evolution of DQ evaluation/improvement tools across functions over time. The 6 core functions are data loading, data profiling, data integration, data transformation, automation and monitoring, and output and reports. Every tool supports the loading and output functions so the middle four

実験結果

リサーチクエスチョン

RQ1機械学習ワークフローに最も関連性の高いデータ品質の主要な次元と指標は何か。
RQ2現在のオープンソースDQツールは機能性・指標・ML中心の適用性の点でどのように比較されるか。
RQ3MLおよびデータ中心AIの未来のオープンソースDQツールを導くロードマップと設計原理は何か。
RQ4新興のAI技術（例：LLMs）がMLのデータ品質評価と改善に与える影響は何か。

主な発見

4つのDQ次元（内在的、文脈的、表現的、アクセス可能性）とMLに関連する12の指標を特定した。
過去5年間の17件のデータ品質評価/改善ツールをレビューし、それらの機能・指標・更新履歴を要約した。
プロファイリング・モニタリングおよびML重視の評価におけるツールの強みを比較分析し、2024年の更新で自動化とモニタリングの傾向を指摘した。
ML指向のDQツールの開発ロードマップを提示し、フレームワーク設計・機能・データ中心AI実践との統合可能性を含む。
ML文脈におけるDQ評価と改善を強化するための大規模言語モデルおよび生成型AIの新たな機会を論じた。

Figure 2: DQ dimensions, metrics, and corresponding tools. It showcases 4 dimensions and 12 DQ metrics in the first and second rows. Beneath each one, corresponding tools are listed, indicating their evaluation focus on the specific metrics and dimensions. The color of each tool represents the last

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。