QUICK REVIEW

[論文レビュー] Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor

Ahmed Sharshar, Hosam Elgendy|arXiv (Cornell University)|Mar 18, 2026

Humor Studies and Applications被引用数 0

ひとこと要約

マルチモーダル・マルチリンガルのベンチマークを提案。英語とアラビア語におけるテキスト、画像、動画の有害なユーモアを、明示的・含蓄的な害を含め検出し、SOTAのオープン/クローズドモデルを評価します。

ABSTRACT

Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing Safe jokes from Harmful ones, with the latter further classified into Explicit (overt) and Implicit (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. Warning: this paper contains example data that may be offensive, harmful, or biased.

研究の動機と目的

暗黙的な害を伴うユーモアの安全性評価におけるギャップを解消する。
英語・アラビア語（普遍的動画文脈を含む）にわたるテキスト、画像、動画をカバーする手動キュレーションデータセットを作成する。
統合的な害検出タスクに対して、オープンソースおよびクローズドソースのLLM/VLMおよび動画LLMsを評価する。
言語特有の弱点と、文化的に根ざした安全性整合性の必要性を検討する。）

提案手法

英語、アラビア語、普遍的コンテンツの3,000のテキスト冗談、6,005のミーム/画像、1,202の短尺動画を害ラベル付きでキュレーションする。
各項目をSafe、Harmful（ExplicitまたはImplicitのサブラベル付き）として多数決で注釈する。
モダリティを横断するクローズドソース（GPT-5.2/4o、Gemini）とオープンソース（DeepSeek-Reasoner、Qwen、LLaMAベース）モデルを混在させ、Harmful vs Safeを二値で評価しExplicit/Implicitごとのリコールを算出する。

実験結果

リサーチクエスチョン

RQ1現在のモデルは英語とアラビア語においてテキスト、画像、動画の有害なユーモアをどれだけ検出できるか。
RQ2モデルは暗黙的（文脈型）害と明示的害の検出ギャップを示すか、またそれは言語依存か。
RQ3オープンソースとクローズドソースのモデルの、多言語・マルチモーダルな有害ユーモア検出における相対的な性能はどうか。
RQ4言語（英語対アラビア語）がマルチモーダルな安全性整合性にどの程度影響するか。

主な発見

Model	Language	Acc	F1	Imp	Exp
GPT-5.2	English	74.7	72.0	49.7	88.5
GPT-4o	English	74.3	70.8	45.1	80.5
Gemini 3 Pro	English	68.1	55.7	10.5	61.3
Gemini 2.5 Pro	English	73.2	67.9	33.7	81.2
DeepSeek-Reasoner	English	85.2	85.2	75.1	83.3
Qwen2.5-14B	English	84.0	83.8	82.8	95.7
GPT-5.2	Arabic	60.6	60.6	47.4	42.0
GPT-4o	Arabic	61.8	61.8	42.0	46.8
Gemini 2.5 Pro	Arabic	70.2	70.1	41.9	68.7
Qwen2.5-14B	Arabic	73.4	72.8	55.1	77.2

クローズドソースのモデルは、モダリティと言語を問わず一般にオープンソースよりも優れている。
英語からアラビア語へ移行する際、特にImplicit害検出で顕著な性能低下が見られる。
Explicit害はImplicit害よりも検出が信頼性高く、アラビア語では多くのモデルでギャップが大きい。
動画・画像モダリティは英語バイアスが強く、普遍コンテンツは英語に比べて総じて低下する一方、ケースによってはアラビア語よりも成績が良い場合がある。
Gemini-2.5-Proはモダリティ・言語・Implicit/Explicit害検出のバランスが最も取れている傾向がある。
オープンソースモデルは安全バイアスを示すか、多モーダル手掛かりに苦戦し、実害内容のリコールに影響を与える。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。