QUICK REVIEW

[論文レビュー] Data Representativity for Machine Learning and AI Systems

Line Katrine Harder Clemmensen, Rune D. Kjærsgaard|arXiv (Cornell University)|Mar 9, 2022

Bayesian Modeling and Causal Inference被引用数 21

ひとこと要約

本論文は機械学習/AIにおけるデータ代表性の概念を概観し、3つの測定可能な概念（reflection、coverage、subgroup representation）を導入し、データ代表性を文書化するためのフレームワークを提案します。

ABSTRACT

Data representativity is crucial when drawing inference from data through machine learning models. Scholars have increased focus on unraveling the bias and fairness in models, also in relation to inherent biases in the input data. However, limited work exists on the representativity of samples (datasets) for appropriate inference in AI systems. This paper reviews definitions and notions of a representative sample and surveys their use in scientific AI literature. We introduce three measurable concepts to help focus the notions and evaluate different data samples. Furthermore, we demonstrate that the contrast between a representative sample in the sense of coverage of the input space, versus a representative sample mimicking the distribution of the target population is of particular relevance when building AI systems. Through empirical demonstrations on US Census data, we evaluate the opposing inherent qualities of these concepts. Finally, we propose a framework of questions for creating and documenting data with data representativity in mind, as an addition to existing dataset documentation templates.

研究の動機と目的

学際的な情報源からの代表サンプルの複数の定義と、それらがML/AI推論にどう関連するかを検討する。
代表性の概念を数学的尺度に結びつけ、評価のための3つの測定可能な概念を提案する。
実データを用いて、coverage（多様性）とpopulation-mimicking（母集団模倣）の概念が実践でどう振る舞うかを示す。
既存のデータセット文書化を補完する形で、データを作成・文書化する際に代表性を念頭に置く質問のフレームワークを提案する。
ML/AIにおけるデータ代表性に関する将来の研究方向を概説する。

提案手法

代表的サンプリングの文献を調査し、それらの概念をML/AI実践に結びつける。
代表性のための3つの測定可能な概念を導入し、既存の概念をそれらに対応づける。
米国国勢調査データを用いて、対立する代表性の概念を実証的に比較する。
NeurIPS 2021 Datasets/BenchmarksおよびICCV 2021のデータセットが代表性概念をどのように反映しているかを検討する。
データシート主導の質問フレームワークを提案し、データ代表性を文書化する。
ML/AIのデータ代表性における新しい研究方向を提案する。

実験結果

リサーチクエスチョン

RQ1AI/ML文献で見られる代表サンプルの異なる概念は何であり、それらは推論とどう関連するか？
RQ2データ代表性を具体的で実装可能な概念を用いてどう測定できるか？
RQ3カバレッジ（多様性）と分布模倣的表現の間にはML/AIデータセットでどのようなトレードオフがあるか？
RQ4データ代表性を文書化するフレームワークは透明性と再現性をどう改善できるか？

主な発見

代表サンプルの概念を複数特定し、用語の曖昧さを指摘した。
反映（population-mimicking）、カバレッジ（多様性ベース）、サブグループ表現（クラスタベース）の3つの測定可能な概念を提案した。
反映とカバレッジの概念の対照的な性質を、国勢調査型データを用いた実証分析で示した。
NeurIPS 2021およびICCV 2021のデータセットを検討し、実践における代表性概念の現れ方を示した。
データシートのための質問フレームワークを提供し、データ代表性を文書化する方法と将来の研究方向を論じた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。