QUICK REVIEW

[论文解读] Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation

Benet Oriol Sàbat, Cristian Canton-Ferrer|arXiv (Cornell University)|Oct 5, 2019

Hate Speech and Cyberbullying Detection参考文献 14被引用 66

一句话总结

本文提出一种将视觉（VGG-16）与文本（OCR+BERT）表示融合的多模态方法来检测表情包中的仇恨言论，结果显示多模态优于任一模态单独，但该任务仍具挑战性。

ABSTRACT

This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of our knowledge. Memes are pixel-based multimedia documents that contain photos or illustrations together with phrases which, when combined, usually adopt a funny meaning. However, hate memes are also used to spread hate through social networks, so their automatic detection would help reduce their harmful societal impact. Our results indicate that the model can learn to detect some of the memes, but that the task is far from being solved with this simple architecture. While previous work focuses on linguistic hate speech, our experiments indicate how the visual modality can be much more informative for hate speech detection than the linguistic one in memes. In our experiments, we built a dataset of 5,020 memes to train and evaluate a multi-layer perceptron over the visual and language representations, whether independently or fused. The source code and mode and models are available https://github.com/imatge-upc/hate-speech-detection .

研究动机与目标

为社交媒体上的仇恨表情包提供自动化审核的动机。
研究将视觉信息与文本信息结合是否能提升表情包中的仇恨言论检测效果。
评估视觉模态与语言模态在表情包中的相对信息量。
提供一个可复现的基线，使用两种模态的最先进编码器。

提出的方法

使用 OCR（Tesseract 4.0.0）从表情包中提取文本。
使用 BERT（bert-base-multilingual-cased）对文本进行编码，并对词向量求平均以获得句子表示。
使用在 ImageNet 上预训练的 VGG-16 对图像进行编码，并将最后一层隐藏层（4096 维）用作图像特征。
将文本和图像特征拼接，形成 4,864 维的多模态表示。
训练一个两隐藏层的多层感知机（两个隐藏层各 100 个神经元，ReLU）以单个输出神经元给出仇恨得分。
使用 Adam 优化器进行训练（学习率 0.1，betas 0.9/0.999，eps 1e-8），批量大小 25， dropout 0.2，MSE 损失按二元准确率评估。

实验结果

研究问题

RQ1是否可以通过融合文本和图像信息的多模态方法检测表情包中的仇恨言论？
RQ2在该任务中，多模态模型是否优于仅视觉或仅文本的模型？
RQ3OCR 质量和语言编码如何影响表情包中的仇恨言论检测？
RQ4与使用单一模态相比，多模态融合的实际优势是什么？

主要发现

多模态融合在三种配置中表现最好。
达到的最佳最大准确率：0.833；平滑后的最大准确率：0.823。
仅视觉的准确率：0.830（0.804 平滑）。
仅文本的准确率：0.761（0.750 平滑）。
最佳多模态模型的平均精度：0.81（精确度–召回）。
由于表情包扭曲和 OCR 限制，OCR 与文本编码质量会影响基于语言的结果。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。