QUICK REVIEW

[論文レビュー] Detecting Hate Speech in Multi-modal Memes

Abhishek Das, Japsimar Singh Wahi|arXiv (Cornell University)|Dec 29, 2020

Hate Speech and Cyberbullying Detection参考文献 30被引用数 38

ひとこと要約

本論文は、画像キャプション生成と感情分析を取り入れて視覚情報とテキスト情報をより適切に整合させ、Facebook Hateful Memes Challengeデータセットにおける善意の混乱因子を考慮したマルチモーダル・ヘイトスピーチ検出に取り組む。

ABSTRACT

In the past few years, there has been a surge of interest in multi-modal problems, from image captioning to visual question answering and beyond. In this paper, we focus on hate speech detection in multi-modal memes wherein memes pose an interesting multi-modal fusion problem. We aim to solve the Facebook Meme Challenge \cite{kiela2020hateful} which aims to solve a binary classification problem of predicting whether a meme is hateful or not. A crucial characteristic of the challenge is that it includes "benign confounders" to counter the possibility of models exploiting unimodal priors. The challenge states that the state-of-the-art models perform poorly compared to humans. During the analysis of the dataset, we realized that majority of the data points which are originally hateful are turned into benign just be describing the image of the meme. Also, majority of the multi-modal baselines give more preference to the hate speech (language modality). To tackle these problems, we explore the visual modality using object detection and image captioning models to fetch the "actual caption" and then combine it with the multi-modal representation to perform binary classification. This approach tackles the benign text confounders present in the dataset to improve the performance. Another approach we experiment with is to improve the prediction with sentiment analysis. Instead of only using multi-modal representations obtained from pre-trained neural networks, we also include the unimodal sentiment to enrich the features. We perform a detailed analysis of the above two approaches, providing compelling reasons in favor of the methodologies used.

研究の動機と目的

対立的な善性の混乱因子を伴う memes において、単一モードのベースラインを超える堅牢なマルチモーダル検出を動機づける。
実際のキャプションを抽出するための画像キャプション生成を調査し、それをマルチモーダル表現と融合させる。
マルチモーダル特徴を豊富にし、分類を改善するためにセンチメント分析を評価する。
物体検出駆動のキャプションと感情指標が Facebook Memes Challenge データセットの予測性能に与える影響を分析する。

提案手法

VisualBERTをマルチモーダルのベースラインとして使用し、画像キャプション生成モジュール（Show, Attend, and Tell; Bottom-Up Top-Down）を追加して画像キャプションを生成し、それをBERTでエンコードしてVisualBERT表現と融合する。
物体検出とキャプショニングを通じて実際の画像キャプションを抽出し、事前抽出キャプションと比較して、分類前に連結または双線形変換で融合する。
テキスト（RoBERTa）と画像（VGGベースの視覚的感情）特徴に対して単一モードの感情分析を適用し、マルチモーダル表現と融合してMLP分類器を訓練する。
画像キャプションと感情指標を、それぞれの表現をVisualBERTの特徴と連結して結合し、性能への影響を評価する。
Facebook Hateful Memes Challengeデータセット上でAUC-ROCと精度を用いて評価する。
画像キャプショニングはベースラインよりAUC-ROCを3.6ポイント、精度を6.7ポイント改善することを報告する。感情分析は精度を約4%向上させる。キャプショニングと感情の両方を組み合わせるとさらなる改善が得られるが、必ずしも単調 increasing とは限らない。

実験結果

リサーチクエスチョン

RQ1ミームから生成された画像キャプションを活用することで、マルチモーダルなヘイトスピーチ検出器を誤らせる善意のテキストの混乱因子を軽減できるか。
RQ2テキストおよび画像モダリティからの感情情報を取り入れることで、既存のマルチモーダル表現を超えたヘイトスピーチ分類の改善につながるか。
RQ3Caption-derived featuresをVisualBERTと融合させることは検出性能にどのような影響を与えるか。
RQ4善意の混乱因子を含むミームのバリアント全体で、画像キャプションと感情指標の組み合わせは一貫してベースラインを上回るか。

主な発見

画像キャプショニングに基づく表現は、VisualBERTベースラインよりAUC-ROCを約3.6ポイント、精度を約6.7ポイント大幅に向上させる。
単一モードの感情分析を取り入れると、テキストと画像の感情が一致する場合や対立する場合に特に、精度が顕著に約4%向上する。
物体検出駆動のキャプションを使用することで善意の混乱要因を特定し、ヘイトフルミーム検出を改善する。
この設定では双線形融合は連結よりも性能が上回らず、遅いため単純な連結融合を好む。
画像キャプショニングと感情特徴をVisualBERTと組み合わせるとさらに改善が得られるが、特徴の衝突や冗長性により精度が低下する場合もある。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。