QUICK REVIEW

[論文レビュー] Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?

R. Stuart Geiger, Kevin Yu|arXiv (Cornell University)|Dec 17, 2019

Topic Modeling参考文献 42被引用数 58

ひとこと要約

この論文は Twitter の ML 分類論文を監査し、人間がラベル付けした訓練データがどのように作成されたかを報告しているかを検証し、アノテータ、訓練、データ出所に関する詳細が大幅に変動し、しばしば欠如していることを発見しました。

ABSTRACT

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

研究の動機と目的

ソーシャル・コンピューティングにおける ML アプリケーション論文が人間がラベル付けした訓練データの出所をどのように報告しているかを評価する。
アノテータの出所、資格、訓練、報酬の透明性を評価する。
インターアノテータ信頼性とデータ入手可能性の報告を検討する。
監視付き ML アプリケーションにおけるデータの信頼性と研究の健全性への影響を強調する。

提案手法

ArXiv および Scopus からの Twitter 分類 ML 論文のコーパスを作成する（約 494 本の ArXiv 論文；29 本の Scopus 論文）。
各論文のデータラベリング実務について構造化コンテンツ分析を行うために、6 名のラベリングチームを使用する。
アノテータ、訓練、定義、データ入手可能性の報告を判断するため、二回のラベリング過程と調整を適用する。
ラベリングの詳細報告を定量化するために、生データと正規化情報スコアを開発する。
ラウンド間の平均一致率としてIC 係数を算出する（ラウンド1: 66.67%、ラウンド2: 84.80%）。
再現性のためにデータセットとコードを GitHub と Zenodo で公開する。

実験結果

リサーチクエスチョン

RQ1Twitter 分類を実施している ML 論文は、訓練データが人間によってラベル付けされたかを開示しているか？
RQ2アノテータは誰か（著者、クラウドワーカー、専門家等）で、どのように募集されたか？
RQ3どの程度の訓練、指示、およびインターアノテータ信頼性指標が報告されているか？
RQ4クラウドワーカーの報酬が開示されているか、訓練データは公開されているか？

主な発見

ほとんどの論文は原著の分類タスクを含んでいた（142 Yes、17 No、5 Unsure）。
人間のアノテーションが含まれる論文のうち、93 論文が人間のアノテーションを報告（Yes）、46 論文は報告なし（No）、4 Unsure。
原著の人間アノテーションを使用した論文では、72 論文が元のアノテーションを報告（Yes）、21 論文は報告なし（No）、3 Unsure。
アノテータの出所は多様であり、著者自身が出典となった論文は22 論文（29.73%）、”情報なし”が一般的だった論文は24.32%、専門家/プロフェッショナルが21.62%、Amazon Mechanical Turk 4.05%、その他クラウドワーク 10.81%、その他 9.46%であった。
元の人間アノテーションを使用した論文の約半数がアノテータ数を特定していた（Yes 41、No 44.60% は特定されず）。
正式な指示や定義が報告された論文は32本（43.24%）、一方で35本（47.30%）は指示について情報がない。7本（9.46%）は問いの本文以外の指示はないと回答していた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。