QUICK REVIEW

[論文レビュー] Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Steven Euijong Whang, Yuji Roh|arXiv (Cornell University)|Dec 13, 2021

Data Quality and Management被引用数 40

ひとこと要約

この調査は、データ中心のAIの観点から、データを第一級の市民として扱い、データ品質がモデル性能に与える影響を強調するためのデータ収集、検証、クリーニング、データサニタイズ、ロバストな訓練、そして公平性の技術をレビューします。

ABSTRACT

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.

研究の動機と目的

データ品質がモデル性能に決定的な影響を与えるデータ中心AIへの移行を促す。
データ取得、ラベリング、深層学習のための既存データの改善を含むデータ収集技術の調査。
データ検証、クリーニング、統合手法と、それらがロバスト性と精度に与える影響を調査。
ノイズの多い、汚染された、または偏ったデータに対処するためのロバストな訓練技術を議論。
訓練前、訓練中、または訓練後に適用可能な公平性の測定と緩和技術を強調。

提案手法

データ中心AIのトピックを分類し、技術をデータ型とワークフローにマッピングする（データ収集、検証/クリーニング/統合、ロバストな訓練、公平性）。
データ取得、ラベリング、検証、クリーニング、サニタイズ、緩和手法の代表的・影響力のある技術を要約。
Present a taxonomy (Table 1) and a workflow decision tree linking techniques across the data-centric AI lifecycle.
実践的なシステムとフレームワーク（例：TF Data Validation、SeeDB、ActiveClean）と注目の手法（GAN、Mixup、データプログラミング）を論じる。
データ管理と機械学習コミュニティ間の結びつきを描き、AIにおける偏り、ロバスト性、倫理を扱う。

実験結果

リサーチクエスチョン

RQ1特徴量エンジニアリングが中心でない場合、ディープラーニングを最も効果的に支援するデータ収集戦略は何か？
RQ2データ検証、クリーニング、統合は深層学習の下流モデルの精度とロバスト性をどのように改善できるか？
RQ3ノイズが多い、汚染された、または偏ったデータを扱う監視付き学習において、どのようなロバストで公平な訓練技術が効果的か？
RQ4データ中心AIチームは取得、ラベリング、クリーニング、サニタイズ、緩和にわたるデータ品質の実践をどのように組織すべきか？
RQ5データ管理と深層学習におけるAI倫理の交差点における主要な未解決課題と今後の方向性は何か？

主な発見

データ収集は深層学習の性能の基盤であり、データ発見、データ拡張、データ生成が核となるアプローチである。
データラベリングは既存ラベルの活用、手作業ラベリング、半教師あり学習、クラウドソーシング、データプログラミングを含む弱い教師あり supervision を活用できる。
データ検証と可視化は人間を介在させたチェックを支援し、スキーマベースの検証と自動異常検知が普及している。
データクリーニングとデータサニタイズはモデルの精度を向上させる場合もあれば、逆効果になる場合もあり得る。適切なクリーニング戦略を選択し、ロバスト性を考慮することが重要。
公平性と偏りの測定は不可欠であり、訓練前、訓練中、訓練後のいずれかで統合できる。緩和技術は前処理、処理中、後処理にまたがる。
頑健性、公平性、データ品質に対処するために、MLライフサイクル全体を横断するデータ中心AI技術の包括的な統合が必要である。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。