QUICK REVIEW

[论文解读] Multimodal datasets: misogyny, pornography, and malignant stereotypes

Abeba Birhane, Vinay Uday Prabhu|arXiv (Cornell University)|Oct 5, 2021

Gender, Feminism, and Media参考文献 56被引用 150

一句话总结

本文审计了 LAION-400M 多模态数据集，揭示了露骨的厌女、色情和偏见内容，并讨论了更广泛的危害以及对相关方的未解问题。

ABSTRACT

We have now entered the era of trillion parameter machine learning models trained on billion-sized datasets scraped from the internet. The rise of these gargantuan datasets has given rise to formidable bodies of critical work that has called for caution while generating these large datasets. These address concerns surrounding the dubious curation practices used to generate these datasets, the sordid quality of alt-text data available on the world wide web, the problematic content of the CommonCrawl dataset often used as a source for training large language models, and the entrenched biases in large-scale visio-linguistic models (such as OpenAI's CLIP model) trained on opaque datasets (WebImageText). In the backdrop of these specific calls of caution, we examine the recently released LAION-400M dataset, which is a CLIP-filtered dataset of Image-Alt-text pairs parsed from the Common-Crawl dataset. We found that the dataset contains, troublesome and explicit images and text pairs of rape, pornography, malign stereotypes, racist and ethnic slurs, and other extremely problematic content. We outline numerous implications, concerns and downstream harms regarding the current state of large scale datasets while raising open questions for various stakeholders including the AI community, regulators, policy makers and data subjects.

研究动机与目标

评估来自 Common Crawl 的大型 LAION-400M 多模态数据集以及 CLIP-filtered 管线构建的内容与偏见。
强调在图像-文本对中存在的厌女、色情、种族主义及其他有害内容的风险。
批判目前用于视觉-语言模型的大型数据集的治理、过滤和脱毒做法。
讨论数据主体、AI 开发者和政策制定者在伦理、监管和实际层面的含义。

提出的方法

通过基于 CLIP 的过滤与替代文本分析对 LAION-400M 内容进行定性与定量检查。
描述数据集构建管线：抓取庞大的 WWW 语料库、按 CLIP 基于相似性进行过滤、并选择图像-文本对。
对若干查询的检索结果在文本和图像过滤下的 NSFW 盛行程度进行经验评估。
讨论 CLIP 的已知偏见及过滤数据中的潜在错误关联。
反思数据收集（抓取）与下游脱毒努力之间的不对称性。

实验结果

研究问题

RQ1LAION-400M 中明确且有害内容（厌女、色情、刻板印象）的盛行程度和性质是什么？
RQ2过滤与治理管线（如 CLIP 基于相似性阈值）如何影响下游的伤害与偏见？
RQ3发布与使用如此大规模的视-语言数据集的伦理、监管与实际影响有哪些？
RQ4数据收集/治理与脱毒努力之间存在哪些不对称性，它们如何影响模型伤害？
RQ5利益相关者（研究者、政策制定者、数据主体）应就数据集组成与使用解决哪些开放性问题？

主要发现

Search string	N_match	(N_nsfw, %nsfw)	NSFW-flag-values
Desi	34516	(11782, 34.1%)	{'UNLIKELY': 9327, 'UNSURE': 2291, 'NSFW': 164}
Nun	16766	(2761, 16.4%)	{'UNLIKELY': 1623, 'UNSURE': 863, 'NSFW': 273}
Latina	37769	(10658, 28.21%)	{'UNSURE': 5724, 'UNLIKELY': 4013, 'NSFW': 918}

LAION-400M 的搜索审计显示与看起来无害的查询相关的 NSFW 和露骨图像（如 Desi、Nun、Latina）。
对敏感术语的匹配中相当一部分包含 NSFW 指示，显示在检索结果中存在偏见和有害关联的风险。
基于 CLIP 的过滤阈值（如余弦相似度 0.3）可能无法阻止有害内容被包含在内，原因包括模型偏见和边角情况。
抓取/创建巨量数据集的容易度与下游脱毒与减害工作所需努力之间存在显著不对称性。
数据集治理过程往往缺乏稳健的、联合的图像-文本过滤，容易传播偏见和刻板印象。
在执行敏感级数据治理的研究人员身上，情感压力和潜在创伤并非微不足道，且常被低估。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。