QUICK REVIEW

[论文解读] Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?

R. Stuart Geiger, Kevin Yu|arXiv (Cornell University)|Dec 17, 2019

Topic Modeling参考文献 42被引用 58

一句话总结

该论文审查关于 Twitter 的机器学习分类论文，查看它们是否报告了人类标注训练数据的创建方式，并发现存在显著差异，且在标注者、培训和数据来源等细节方面常常缺失。

ABSTRACT

Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.

研究动机与目标

评估社交计算领域的 ML 应用论文如何报告人类标注训练数据的来源。
评估标注者来源、资质、培训和报酬的透明度。
审查对标注者间一致性（inter-annotator reliability）和数据可用性的报告。
强调对监督式机器学习应用中数据可靠性与研究完整性的影响。

提出的方法

从 ArXiv 和 Scopus 汇集 Twitter 分类的机器学习论文语料库（大约 494 篇 ArXiv 论文；29 篇 Scopus 论文）。
使用六人标注团队对每篇论文就数据标注实践进行结构化内容分析。
采用双轮标注过程并进行调和，以确定对标注者、培训、定义和数据可用性的报告。
开发原始与规范化信息分数，以量化标注细节的报告情况。
将评注者之间的一致性（IRR）计算为各轮的平均百分比一致性（第一轮 66.67%；第二轮 84.80%）。
在 GitHub 和 Zenodo 上提供数据集和代码以实现可重复性。

实验结果

研究问题

RQ1从事 Twitter 分类的 ML 论文是否披露训练数据是否由人来标注？
RQ2标注者是谁（作者、众包工人、专家等），以及他们如何被招募？
RQ3报告了何种程度的培训、指示和标注者间一致性指标？
RQ4是否披露众包工人的报酬，且训练数据是否可公开获取？

主要发现

多数论文涉及原始分类任务（142 是，17 否，5 不确定）。
在涉及人工标注的论文中，93 篇报告了人工标注（是），46 篇未报告（否），4 篇不确定。
对于使用原始人工标注的论文，72 篇报告使用了原始标注（是），21 篇未报告（否），3 篇不确定。
标注者来源多样：作者本人在 22 篇论文中为来源（29.73%），且“无信息”也很常见（24.32%）；专家/专业人员占比 21.62%，亚马逊 Mechanical Turk 4.05%，其他众包工作 10.81%，其他 9.46%。
在使用原始人类标注的论文中，约一半明确标注人数（Yes 41；44.60% 未指明）。
正式指示或定义在 32 篇论文中有报告（43.24%），而 35 篇（47.30%）没有提供关于指示的信息；7 篇（9.46%）表示除了问题文本之外没有指示。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。