QUICK REVIEW

[論文レビュー] A Survey on Data Collection for Machine Learning: a Big Data -- AI Integration Perspective

Yuji Roh, Geon Heo|arXiv (Cornell University)|Nov 8, 2018

Data Stream Mining Techniques参考文献 164被引用数 150

ひとこと要約

この調査はデータ管理の観点から機械学習のデータ収集をレビューし、データ取得、ラベリング、既存データやモデルの改善、さらなる課題と実践の指針を扱う。

ABSTRACT

Data collection is a major bottleneck in machine learning and an active research topic in multiple communities. There are largely two reasons data collection has recently become a critical issue. First, as machine learning is becoming more widely-used, we are seeing new applications that do not necessarily have enough labeled data. Second, unlike traditional machine learning, deep learning techniques automatically generate features, which saves feature engineering costs, but in return may require larger amounts of labeled data. Interestingly, recent research in data collection comes not only from the machine learning, natural language, and computer vision communities, but also from the data management community due to the importance of handling large amounts of data. In this survey, we perform a comprehensive study of data collection from a data management point of view. Data collection largely consists of data acquisition, data labeling, and improvement of existing data or models. We provide a research landscape of these operations, provide guidelines on which technique to use when, and identify interesting research challenges. The integration of machine learning and data management for data collection is part of a larger trend of Big data and Artificial Intelligence (AI) integration and opens many opportunities for new research.

研究の動機と目的

研究の動機づけ: MLにおけるデータ収集がボトルネックとなっている点と深層学習および新しい応用に伴う重要性の高まりを強調する。
MLとデータ管理の文献を橋渡しし、データ収集技術の広い全体像を提供する。
MLに関連するデータ取得、ラベリング、データ改善手法を分類・要約する。
特定のデータ収集技術をいつ適用すべきかの指針を提供し、未解決の研究課題を特定する。

提案手法

データ取得技術をデータ発見、データ拡張、データ生成に分類する。
既存のラベルの活用、クラウドソーシング、弱い監視を含むデータラベリングのアプローチを要約する。
より良いデータやモデルの性能のためのデータ品質向上とクレンジング技術をレビューする。
MLタスクのデータ収集技術を選択する際の指針となる意思決定フローチャートを提示する。
クラウドソーシング、GAN、方針主導の変換による合成データ生成を論じる。
MLパイプラインに関連するデータ統合とエンティティ拡張のアプローチを要約する。

実験結果

リサーチクエスチョン

RQ1NLP・CVなどのMLのサブフィールド全体で、どのデータ取得・ラベリング・データ改善手法が機械学習に最も関連性が高いか？
RQ2データ管理ツールとパラダイムをどのように活用してMLアプリケーション向けのデータ収集をスケールさせるか？
RQ3異なるデータタイプと適用ニーズを考慮して、実務者がデータ収集技術を選ぶ際の指針は何か？
RQ4Big DataとAIの統合という観点から見たMLのデータ収集における主要な未解決課題は何か？

主な発見

データ取得技術はデータ発見、データ拡張、データ生成を網羅し、それぞれがデータセットの入手可能性と品質の異なる段階に対応する。
データラベリングは手動ラベリングを超え、クラウドソーシングや弱い監視へと進化し、ラベリング作業をスケールさせる。
データ品質の向上とデータ統合は、モデルの性能と学習効率に大きな影響を与えうる。
合成データ生成と方針主導の変換は、実データが不足または取得コストが高い場合に柔軟でスケーラブルな選択肢を提供する。
統一された意思決定フローは、まずデータの入手可能性を評価し、次に取得・ラベリング・改善のいずれの経路を選ぶかを実務家が選択するのに役立つ。
本調査は、データ管理実践とMLのニーズの統合を、より広いBig Data–AIの動きの一部として強調している。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。