QUICK REVIEW

[論文レビュー] Hate Speech in Pixels: Detection of Offensive Memes towards Automatic Moderation

Benet Oriol Sàbat, Cristian Canton-Ferrer|arXiv (Cornell University)|Oct 5, 2019

Hate Speech and Cyberbullying Detection参考文献 14被引用数 66

ひとこと要約

この論文は、視覚（VGG-16）と文字（OCR+BERT）表現を融合して、ミームにおけるヘイトスピーチを検出する多モーダルアプローチを提示し、多モーダルがどちらのモダリティ単独よりも優れていることを示すが、課題は残る。

ABSTRACT

This work addresses the challenge of hate speech detection in Internet memes, and attempts using visual information to automatically detect hate speech, unlike any previous work of our knowledge. Memes are pixel-based multimedia documents that contain photos or illustrations together with phrases which, when combined, usually adopt a funny meaning. However, hate memes are also used to spread hate through social networks, so their automatic detection would help reduce their harmful societal impact. Our results indicate that the model can learn to detect some of the memes, but that the task is far from being solved with this simple architecture. While previous work focuses on linguistic hate speech, our experiments indicate how the visual modality can be much more informative for hate speech detection than the linguistic one in memes. In our experiments, we built a dataset of 5,020 memes to train and evaluate a multi-layer perceptron over the visual and language representations, whether independently or fused. The source code and mode and models are available https://github.com/imatge-upc/hate-speech-detection .

研究の動機と目的

ソーシャルメディア上のヘイトミームの自動モデレーションを動機づける。
視覚情報と文字情報を組み合わせることで、ミームのヘイトスピーチ検出が改善されるかを調査する。
ミームにおける視覚モダリティと言語モダリティの相対的な情報量を評価する。
両モダリティの最先端エンコーダを用いた再現性のあるベースラインを提供する。

提案手法

OCR（Tesseract 4.0.0）でミームからテキストを抽出する。
テキストをBERT（bert-base-multilingual-cased）でエンコードし、単語埋め込みを平均して文表現を得る。
ImageNetで事前学習済みのVGG-16で画像をエンコードし、最後の隠れ層（4096次元）を画像特徴として用いる。
テキストと画像の特徴を結合して、4,864次元の多モーダル表現を形成する。
2つの隠れ層を持つMLP（各100ニューロン、ReLU）を訓練し、ヘイトスコアの出力ニューロンを1つで終える。
Adamオプティマイザで訓練する（lr 0.1、betas 0.9/0.999、eps 1e-8）、バッチサイズ25、ドロップアウト0.2、損失は二値精度で評価されるMSE損失。

実験結果

リサーチクエスチョン

RQ1テキストと画像情報を融合した多モーダルアプローチで、ミームのヘイトスピーチを検出できるか。
RQ2本タスクにおいて、多モーダルモデルは視覚のみまたは文字のみのモデルより優れているか。
RQ3OCR品質と言語エンコードは、ミームのヘイトスピーチ検出にどのような影響を与えるか。
RQ4単一モダリティを使用する場合と比較して、多モーダル融合の実用的な利点は何か。

主な発見

多モーダル融合が3つの構成の中で最も良いパフォーマンスを示す。
最大精度の最良値：0.833；平滑化後の最大精度：0.823。
視覚のみの精度：0.830（0.804 平滑化）。
文字のみの精度：0.761（0.750 平滑化）。
最良の多モーダルモデルの平均適合度（Precision–Recall）：0.81。
OCRとテキストエンコードの品質は、ミームのゆがみとOCRの制約のため言語ベースの結果に影響を与えうる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。