[论文解读] Garbage In, Garbage Out? Do Machine Learning Application Papers in Social Computing Report Where Human-Labeled Training Data Comes From?
该论文审查关于 Twitter 的机器学习分类论文,查看它们是否报告了人类标注训练数据的创建方式,并发现存在显著差异,且在标注者、培训和数据来源等细节方面常常缺失。
Many machine learning projects for new application areas involve teams of humans who label data for a particular purpose, from hiring crowdworkers to the paper's authors labeling the data themselves. Such a task is quite similar to (or a form of) structured content analysis, which is a longstanding methodology in the social sciences and humanities, with many established best practices. In this paper, we investigate to what extent a sample of machine learning application papers in social computing --- specifically papers from ArXiv and traditional publications performing an ML classification task on Twitter data --- give specific details about whether such best practices were followed. Our team conducted multiple rounds of structured content analysis of each paper, making determinations such as: Does the paper report who the labelers were, what their qualifications were, whether they independently labeled the same items, whether inter-rater reliability metrics were disclosed, what level of training and/or instructions were given to labelers, whether compensation for crowdworkers is disclosed, and if the training data is publicly available. We find a wide divergence in whether such practices were followed and documented. Much of machine learning research and education focuses on what is done once a "gold standard" of training data is available, but we discuss issues around the equally-important aspect of whether such data is reliable in the first place.
研究动机与目标
- 评估社交计算领域的 ML 应用论文如何报告人类标注训练数据的来源。
- 评估标注者来源、资质、培训和报酬的透明度。
- 审查对标注者间一致性(inter-annotator reliability)和数据可用性的报告。
- 强调对监督式机器学习应用中数据可靠性与研究完整性的影响。
提出的方法
- 从 ArXiv 和 Scopus 汇集 Twitter 分类的机器学习论文语料库(大约 494 篇 ArXiv 论文;29 篇 Scopus 论文)。
- 使用六人标注团队对每篇论文就数据标注实践进行结构化内容分析。
- 采用双轮标注过程并进行调和,以确定对标注者、培训、定义和数据可用性的报告。
- 开发原始与规范化信息分数,以量化标注细节的报告情况。
- 将评注者之间的一致性(IRR)计算为各轮的平均百分比一致性(第一轮 66.67%;第二轮 84.80%)。
- 在 GitHub 和 Zenodo 上提供数据集和代码以实现可重复性。
实验结果
研究问题
- RQ1从事 Twitter 分类的 ML 论文是否披露训练数据是否由人来标注?
- RQ2标注者是谁(作者、众包工人、专家等),以及他们如何被招募?
- RQ3报告了何种程度的培训、指示和标注者间一致性指标?
- RQ4是否披露众包工人的报酬,且训练数据是否可公开获取?
主要发现
- 多数论文涉及原始分类任务(142 是,17 否,5 不确定)。
- 在涉及人工标注的论文中,93 篇报告了人工标注(是),46 篇未报告(否),4 篇不确定。
- 对于使用原始人工标注的论文,72 篇报告使用了原始标注(是),21 篇未报告(否),3 篇不确定。
- 标注者来源多样:作者本人在 22 篇论文中为来源(29.73%),且“无信息”也很常见(24.32%);专家/专业人员占比 21.62%,亚马逊 Mechanical Turk 4.05%,其他众包工作 10.81%,其他 9.46%。
- 在使用原始人类标注的论文中,约一半明确标注人数(Yes 41;44.60% 未指明)。
- 正式指示或定义在 32 篇论文中有报告(43.24%),而 35 篇(47.30%)没有提供关于指示的信息;7 篇(9.46%)表示除了问题文本之外没有指示。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。