QUICK REVIEW

[论文解读] Dataset for Identification of Homophobia and Transophobia in Multilingual YouTube Comments

Bharathi Raja Chakravarthi, Ruba Priyadharshini|arXiv (Cornell University)|Sep 1, 2021

Hate Speech and Cyberbullying Detection参考文献 64被引用 56

一句话总结

这篇论文提出一个分层分类法以及一个专家标注的多语言 YouTube 评论数据集，用于识别同性恐惧症和跨性别恐惧症，并提供基线模型。

ABSTRACT

The increased proliferation of abusive content on social media platforms has a negative impact on online users. The dread, dislike, discomfort, or mistrust of lesbian, gay, transgender or bisexual persons is defined as homophobia/transphobia. Homophobic/transphobic speech is a type of offensive language that may be summarized as hate speech directed toward LGBT+ people, and it has been a growing concern in recent years. Online homophobia/transphobia is a severe societal problem that can make online platforms poisonous and unwelcome to LGBT+ people while also attempting to eliminate equality, diversity, and inclusion. We provide a new hierarchical taxonomy for online homophobia and transphobia, as well as an expert-labelled dataset that will allow homophobic/transphobic content to be automatically identified. We educated annotators and supplied them with comprehensive annotation rules because this is a sensitive issue, and we previously discovered that untrained crowdsourcing annotators struggle with diagnosing homophobia due to cultural and other prejudices. The dataset comprises 15,141 annotated multilingual comments. This paper describes the process of building the dataset, qualitative analysis of data, and inter-annotator agreement. In addition, we create baseline models for the dataset. To the best of our knowledge, our dataset is the first such dataset created. Warning: This paper contains explicit statements of homophobia, transphobia, stereotypes which may be distressing to some readers.

研究动机与目标

提出一个在线同性恐惧症与跨性别恐惧症的分层分类体系。
创建并分享一个由专家标注的多语言 YouTube 评论数据集。
由于文化敏感性，通过教育者主导的指南确保标注质量。
评估标注过程中的标注者一致性。
提供用于识别同性恐惧症/跨性别恐惧症内容的基线模型。

提出的方法

为在线评论中的同性恐惧症与跨性别恐惧症开发一个新的分层分类体系。
汇集并标注一个多语言数据集，使用专家标注者和全面的规则。
对标注者进行培训，并使用结构化的标注指南以减轻偏见。
分析标注的定性方面和标注者之间的一致性。
构建用于自动识别目标内容的基线模型。

实验结果

研究问题

RQ1在分层分类体系下，多语言在线 YouTube 评论中的同性恐惧症与跨性别恐惧症的构成是什么？
RQ2专家标注和明确规则如何提高敏感内容标注的可靠性？
RQ3标注数据集的规模和多语言构成如何？
RQ4基线模型在识别多语言 YouTube 评论中的同性恐惧/跨性别恐惧内容方面的表现如何？
RQ5标注过程中的标注者一致性如何？

主要发现

数据集包含 15,141 条标注的多语言评论。
采用由专家主导、配备全面规则的标注过程以提高可信度。
论文分析数据的定性方面并报告标注者之间的一致性。
构建了基线模型，以确立该数据集的初始性能。
这项工作似乎是首批为这一主题提供此类数据集的研究之一。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。