QUICK REVIEW

[论文解读] Data Collection and Quality Challenges in Deep Learning: A Data-Centric AI Perspective

Steven Euijong Whang, Yuji Roh|arXiv (Cornell University)|Dec 13, 2021

Data Quality and Management被引用 40

一句话总结

这份综述从数据驱动的 AI 角度回顾深度学习的数据收集、验证、清理、数据净化、鲁棒训练和公平性技术，强调数据作为第一公民以及数据质量在模型性能中的作用。

ABSTRACT

Data-centric AI is at the center of a fundamental shift in software engineering where machine learning becomes the new software, powered by big data and computing infrastructure. Here software engineering needs to be re-thought where data becomes a first-class citizen on par with code. One striking observation is that a significant portion of the machine learning process is spent on data preparation. Without good data, even the best machine learning algorithms cannot perform well. As a result, data-centric AI practices are now becoming mainstream. Unfortunately, many datasets in the real world are small, dirty, biased, and even poisoned. In this survey, we study the research landscape for data collection and data quality primarily for deep learning applications. Data collection is important because there is lesser need for feature engineering for recent deep learning approaches, but instead more need for large amounts of data. For data quality, we study data validation, cleaning, and integration techniques. Even if the data cannot be fully cleaned, we can still cope with imperfect data during model training using robust model training techniques. In addition, while bias and fairness have been less studied in traditional data management research, these issues become essential topics in modern machine learning applications. We thus study fairness measures and unfairness mitigation techniques that can be applied before, during, or after model training. We believe that the data management community is well poised to solve these problems.

研究动机与目标

推动向数据为中心的 AI 转变，其中数据质量对模型性能具有关键影响。
综述数据收集技术，包括获取、标注，以及改进现有数据以用于深度学习。
综述数据验证、清理和集成方法及其对鲁棒性和准确性的影响。
讨论应对嘈杂、被污染或有偏数据的鲁棒训练技术。
突出在模型训练前、期间或之后可应用的公平性衡量与缓解技术。

提出的方法

对数据为中心的 AI 主题进行分类，并将技术映射到数据类型与工作流（数据收集、验证/清理/集成、鲁棒训练、公平性）。
总结在数据获取、标注、验证、清理、净化和缓解方法方面的代表性和有影响力的技术。
给出一个 taxonomy（Table 1）和一个工作流决策树，将技术在数据为中心的 AI 生命周期中连接起来。
讨论实际系统与框架（如 TF Data Validation、SeeDB、ActiveClean）以及著名方法（GANs、Mixup、data programming）。
在数据管理与 ML 社区之间建立联系，以应对 AI 中的偏见、鲁棒性与道德问题。

实验结果

研究问题

RQ1在特征工程不那么核心时，哪些数据收集策略最能支持深度学习？
RQ2数据验证、清理与整合如何提高深度学习下游模型的准确性和鲁棒性？
RQ3在有监督学习中，哪些鲁棒且公平的训练技术对于处理嘈杂、被污染或有偏的数据效果明显？
RQ4数据为中心的 AI 团队应如何在获取、标注、清理、净化和缓解等环节协调数据质量实践？
RQ5在深度学习领域数据管理与 AI 道德交叉处的关键挑战和未来方向有哪些？

主要发现

数据收集是深度学习性能的基础，数据发现、数据增强和数据生成是核心方法。
数据标注可以利用现有标签、人工标注、半监督学习、众包和弱监督，包括 data programming。
数据验证和可视化有助于人机在环检查，基于模式的验证和自动异常检测正在获得关注。
数据清理和数据净化可以改善，亦可能有时损害模型准确性；选择合适的清理策略并考虑鲁棒性至关重要。
公平性和偏见衡量至关重要，可以在训练前、训练中或训练后整合；缓解技术涵盖预处理、在处理和后处理。
需要对数据为中心的 AI 技术在整个 ML 生命周期内进行全面协同，以解决鲁棒性、公平性和数据质量。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。