[论文解读] SalamahBench: Toward Standardized Safety Evaluation for Arabic Language Models
SalamahBench 以 8,170 个提示跨 12 个 MLCommons 危险类别标准化阿拉伯语言模型的安全评估,并分析安全 guard 模型与 ALMs 作为自我守卫的效果以研究类别感知的安全性与守卫有效性。
Safety alignment in Language Models (LMs) is fundamental for trustworthy AI. However, while different stakeholders are trying to leverage Arabic Language Models (ALMs), systematic safety evaluation of ALMs remains largely underexplored, limiting their mainstream uptake. Existing safety benchmarks and safeguard models are predominantly English-centric, limiting their applicability to Arabic Natural Language Processing (NLP) systems and obscuring fine-grained, category-level safety vulnerabilities. This paper introduces SalamaBench, a unified benchmark for evaluating the safety of ALMs, comprising $8,170$ prompts across $12$ different categories aligned with the MLCommons Safety Hazard Taxonomy. Constructed by harmonizing heterogeneous datasets through a rigorous pipeline involving AI filtering and multi-stage human verification, SalamaBench enables standardized, category-aware safety evaluation. Using this benchmark, we evaluate five state-of-the-art ALMs, including Fanar 1 and 2, ALLaM 2, Falcon H1R, and Jais 2, under multiple safeguard configurations, including individual guard models, majority-vote aggregation, and validation against human-annotated gold labels. Our results reveal substantial variation in safety alignment: while Fanar 2 achieves the lowest aggregate attack success rates, its robustness is uneven across specific harm domains. In contrast, Jais 2 consistently exhibits elevated vulnerability, indicating weaker intrinsic safety alignment. We further demonstrate that native ALMs perform substantially worse than dedicated safeguard models when acting as safety judges. Overall, our findings highlight the necessity of category-aware evaluation and specialized safeguard mechanisms for robust harm mitigation in ALMs.
研究动机与目标
- 推动针对阿拉伯语语言模型(ALMs)的语言特定安全评估需求。
- 创建 SalamahBench,一个统一的、类别感知的阿拉伯语安全基准,映射到 MLCommons 分类。
- 评估多种 ALMs 与多种守卫配置,以刻画安全对齐与守卫有效性。
提出的方法
- 通过聚合异构的阿拉伯语安全数据集并将提示映射到 MLCommons 的十二个危险类别来构建 SalamahBench。
- 应用 AI 过滤与多阶段人工验证,确保数据质量并与阿拉伯语语言与文化对齐。
- 在多种守卫配置下评估五种 ALMs(Fanar 1、Fanar 2、ALLaM 2、Falcon H1R、Jais 2),包括单个守卫、多数表决聚合、金标签验证。
- 使用阿拉伯语原生安全判断,而非翻译,以避免翻译失真。
实验结果
研究问题
- RQ1在 MLCommons 危险类别中,ALMs 在类别感知安全评估上的表现如何?
- RQ2原生语言的守卫配置(包括多数表决)在降低攻击成功率和与人类判断的一致性方面有多大效果?
- RQ3作为自我守卫的原生 ALMs 与作为专用守卫模型的表现有何差异?
- RQ4在不同模型之间,哪些危险类别在安全对齐方面表现最强或最弱?
主要发现
- Fanar 2 在所评估的 ALMs 中实现了最低的总体攻击成功率,但在不同危害领域的鲁棒性不均衡。
- Jais 2 在多项危害类别上显著表现出更高的易受攻击性,表明内在安全对齐较弱。
- 原生 ALMs 作为自我守卫在安全判断任务上远不如专用的守卫模型。
- 多模型守卫的多数表决聚合在与人类金标签的一致性方面达到最佳,并在各类别中产生最低的攻击成功率。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。