QUICK REVIEW

[论文解读] Layer-wise Swapping for Generalizable Multilingual Safety

Hyunseo Shin, Wonseok Hwang|arXiv (Cornell University)|Jan 30, 2026

Natural Language Processing Techniques被引用 0

一句话总结

本论文提出一种无需训练的安全感知层/模块交换方法，将英文安全专家的安全对齐 transferring 到资源匮乏的多语言模型，在保持通用语言理解能力的同时实现更好的多语言安全性。

ABSTRACT

Despite the rapid advancements of Large Language Models (LLMs), safety risks remain a critical challenge for low-resource languages. Existing safety datasets are predominantly English centric, limiting progress in multilingual safety alignment. As a result, low resource expert models, finetuned on their respective instruction datasets, tend to exhibit higher unsafety rates compared to their high resource counterparts. In this work, we propose a safety aware layer swapping method that transfers safety alignment from an English safety expert to low resource language experts without additional training. To further enhance transfer ability, our method adaptively selects or blends modules based on their degree of specialization. Our approach preserves performance on general language understanding tasks while enhancing safety in the target languages. Experimental results show that the proposed method achieves comparable performance to the language expert on general benchmarks such as MMMLU, BELEBELE, and MGSM, while producing more aligned and less harmful responses on the MultiJail safety benchmark.

研究动机与目标

Address safety gaps in multilingual LLMs due to English-centric safety datasets.
Propose a training-free layer-wise and module-wise swapping method to transfer safety alignment from English to low-resource languages.
Develop a task-vector-based framework to compose multilinguality and safety specialization within a unified representation space.
Automatically select or blend modules (attention and MLP) based on their degree of specialization to optimize safety transfer.
Demonstrate improved multilingual safety on low-resource languages while maintaining performance on general benchmarks.

提出的方法

Formulate layer swapping as composing multilingual and safety task vectors (theta differences from base model).
Extend to module-wise swapping by decomposing into self-attention and MLP modules and computing their task vectors.
Compute module-wise importance using relative update magnitudes and normalize to produce layer/module importance scores.
Automatically select or blend modules with a threshold tau and an interpolation weight alpha (default tau=0.001, alpha=0.5).
Provide an efficient, training-free procedure (Algorithm 1) to construct hybrid models by merging safety and multilingual updates.

Figure 1: Comparison between prior layer swapping (Bandarkar et al. , 2025 ) , which relies on static, manual layer replacement (left), and our proposed safety-aware swapping method that automatically identifies and merges optimal attention and MLP modules for safety transfer (right).

实验结果

研究问题

RQ1Can safety alignment from an English safety expert be transferred to low-resource languages without additional training?
RQ2Does dynamic, module-wise swapping preserve general language understanding while improving safety across languages?
RQ3How should attention vs. MLP modules be blended or selected to optimize multilingual safety transfer?
RQ4What is the impact of automatic layer/module selection on safety and general performance across multiple low-resource languages?
RQ5Is the proposed approach robust across different base models and multilingual benchmarks?

主要发现

Layer-wise swapping reduces multilingual unsafety on several languages while maintaining general performance on benchmarks like MMMLU, BELEBELE, and MGSM.
Module-wise swapping further improves cross-lingual robustness and safety transfer compared to layer-wise swapping and fixed baselines.
The adaptive, training-free merging strategy achieves safer outputs with competitive or improved performance relative to language-only or safety-only baselines.
Safety judges integrated into evaluation show high agreement with human judgments for harmful prompts (Qwen Guard ~85.5% avg accuracy).
Ablation studies indicate tau=0.001 and alpha=0.5 provide best trade-offs between safety transfer and language understanding.

Figure 2: Workflow of our method. We begin with a pretrained base model and its safety-tuned and multilingual-tuned models. For each layer, we compute parameter updates $W$ relative to the base from safety-tuned and multilingual-tuned experts, measure module-wise importance (Attention and FFN), and

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。