[论文解读] Harm or Humor: A Multimodal, Multilingual Benchmark for Overt and Covert Harmful Humor
简要结论:引入一个多模态、跨语言基准,用于检测英文和阿拉伯文文本、图像和视频中的有害幽默(包括显性和隐性伤害),并评估最先进的开放/封闭模型。
Dark humor often relies on subtle cultural nuances and implicit cues that require contextual reasoning to interpret, posing safety challenges that current static benchmarks fail to capture. To address this, we introduce a novel multimodal, multilingual benchmark for detecting and understanding harmful and offensive humor. Our manually curated dataset comprises 3,000 texts and 6,000 images in English and Arabic, alongside 1,200 videos that span English, Arabic, and language-independent (universal) contexts. Unlike standard toxicity datasets, we enforce a strict annotation guideline: distinguishing Safe jokes from Harmful ones, with the latter further classified into Explicit (overt) and Implicit (Covert) categories to probe deep reasoning. We systematically evaluate state-of-the-art (SOTA) open and closed-source models across all modalities. Our findings reveal that closed-source models significantly outperform open-source ones, with a notable difference in performance between the English and Arabic languages in both, underscoring the critical need for culturally grounded, reasoning-aware safety alignment. Warning: this paper contains example data that may be offensive, harmful, or biased.
研究动机与目标
- 解决需要文化与上下文推理的隐性有害幽默在安全评估中的空白
- 创建一个人工 curate 的跨模态数据集,覆盖英文和阿拉伯文文本、图像和视频(外加通用视频情境)
- 在统一的有害检测任务上评估开源/闭源的大语言模型(LLM)及视觉语言模型(VLM)与视频LLM
- 研究语言特定弱点以及对文化意义安全对齐的需求
提出的方法
- 精选3000条文本笑话、6005张表情包/图片和1202段短视频,覆盖英文、阿拉伯文及通用内容,并给出有害标签
- 将每条目标注为 Safe、Harmful,并在 Harmful 下细分为 Explicit(显性)或 Implicit(隐性),采用多数投票
- 评估包括闭源模型(GPT-5.2/4o、Gemini)和开源模型(DeepSeek-Reasoner、Qwen、基于LLaMA)在多模态下的二元分类:有害 vs 安全,并对 Explicit/Implicit 逐项计算召回率
实验结果
研究问题
- RQ1当前模型在文本、图片和视频跨英文与阿拉伯文检测有害幽默的能力如何?
- RQ2模型在隐性(需要上下文)伤害检测与显性伤害检测之间是否存在差距,且该差距是否受语言影响?
- RQ3开源与闭源模型在多语言多模态有害幽默检测中的相对表现为何?
- RQ4语言(英文 vs 阿拉伯文)对多模态幽默理解中的安全对齐有多大影响?
主要发现
| Model | Language | Acc | F1 | Imp | Exp |
|---|---|---|---|---|---|
| GPT-5.2 | English | 74.7 | 72.0 | 49.7 | 88.5 |
| GPT-4o | English | 74.3 | 70.8 | 45.1 | 80.5 |
| Gemini 3 Pro | English | 68.1 | 55.7 | 10.5 | 61.3 |
| Gemini 2.5 Pro | English | 73.2 | 67.9 | 33.7 | 81.2 |
| DeepSeek-Reasoner | English | 85.2 | 85.2 | 75.1 | 83.3 |
| Qwen2.5-14B | English | 84.0 | 83.8 | 82.8 | 95.7 |
| GPT-5.2 | Arabic | 60.6 | 60.6 | 47.4 | 42.0 |
| GPT-4o | Arabic | 61.8 | 61.8 | 42.0 | 46.8 |
| Gemini 2.5 Pro | Arabic | 70.2 | 70.1 | 41.9 | 68.7 |
| Qwen2.5-14B | Arabic | 73.4 | 72.8 | 55.1 | 77.2 |
- 闭源模型在各模态和语言上通常优于开源模型
- 从英文到阿拉伯文的性能存在显著下降,隐性伤害检测尤为明显
- 显性伤害比隐性伤害更容易被检测,且在阿拉伯文下许多模型存在更大差距
- 视频和图像模态显示出较强的英文偏好,通用内容通常表现不如英文、但在某些情形优于阿拉伯文
- Gemini-2.5-Pro 在跨模态、跨语言及隐性/显性伤害检测上往往提供最为均衡的表现
- 开源模型倾向于表现出安全偏差,或在多模态线索处理上存在困难,影响对实际有害内容的召回率
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。