QUICK REVIEW

[论文解读] To Mix or To Merge: Toward Multi-Domain Reinforcement Learning for Large Language Models

Haoqing Wang, Xiang Long|arXiv (Cornell University)|Feb 13, 2026

Topic Modeling被引用 0

一句话总结

论文比较混合多任务 RLVR 与先单独域 RLVR 再进行多域大语言模型合并的方法，结果显示混合训练在与合并相当的情况下具有跨域协同效应，并对权重变化、策略邻域和验证模式进行了广泛分析。

ABSTRACT

Reinforcement Learning with Verifiable Rewards (RLVR) plays a key role in stimulating the explicit reasoning capability of Large Language Models (LLMs). We can achieve expert-level performance in some specific domains via RLVR, such as coding or math. When a general multi-domain expert-level model is required, we need to carefully consider the collaboration of RLVR across different domains. The current state-of-the-art models mainly employ two different training paradigms for multi-domain RLVR: mixed multi-task RLVR and separate RLVR followed by model merging. However, most of the works did not provide a detailed comparison and analysis about these paradigms. To this end, we choose multiple commonly used high-level tasks (e.g., math, coding, science, instruction following, and agent) as our target domains and design extensive qualitative and quantitative experiments using open-source datasets. We find the RLVR across domains exhibits few mutual interferences, and reasoning-intensive domains demonstrate mutually synergistic effects. Furthermore, we analyze the internal mechanisms of mutual gains from the perspectives of weight space geometry, information constraints, model prediction behavior and self-verification. This project is named as M2RL that means Mixed multi-task training or separate training followed by model Merging for Reinforcement Learning, and the homepage is at https://github.com/Mosi-AI/M2RL.

研究动机与目标

需要一个在多域（数学、编码、科学、指令执行）上具有专业水平的通用型大语言模型的动机与必要性。
评估两种流行的多域 RLVR 范式：混合多任务 RLVR 与先分域 RLVR 再进行模型合并。
分析驱动跨域收益的内部机制——权重空间几何、预测行为和信息约束。
量化在训练效率（GPU 小时）和各域基准精度方面的权衡。

提出的方法

使用带可验证奖励的强化学习（RLVR），以组相对策略优化（GRPO）作为学习算法。
构建四个目标域（数学、编码、科学、指令执行），并对每个域使用 Nemotron 基于的 SFT 与 RLVR 数据集。
将混合多任务 RLVR 与独立域 RLVR 及权重合并（平均、任务算术、TIES 合并、SCE）以及多教师在策略蒸馏（MT-OPD）进行对比。
利用开源数据集进行监督微调（SFT）和 RLVR；以 Qwen3-4B-Base 作为起始模型；在9 个基准上的 Avg@K 汇报。
在多个基准（AIME’24/’25、LiveCodeBench v5/v6、HLE、GPQA-Diamond、IFEval、IFBench、MMLU-Redux）上评估合并方法与多任务 RLVR。
研究内部机制：权重迁移的重叠、投影后权重的余弦相似性、KL 散度行为，以及策略邻域效应。

实验结果

研究问题

RQ1混合多任务 RLVR 在多域上是否能够达到与先分域 RLVR 再合并相当的性能？
RQ2跨域干扰的存在程度如何，推理密集型域是否存在协同收益？
RQ3权重空间变化和策略分布如何介导多域 RLVR 的跨域收益？
RQ4不同的模型合并技术（平均、任务算术、TIES、SCE、MT-OPD）对跨域性能有何影响？
RQ5验证模式（基于结果的 vs 基于过程的）如何与多域 RLVR 策略和域特征互动？

主要发现

混合多任务 RLVR 在与合并的分离 RLVR 相当的性能下，仅使用约 33.2% 的 GPU 小时。
跨域 RLVR 显示极少的跨任务干扰，推理密集型域呈现协同收益。
跨域的权重迁移足迹高度重叠，在投影后的正余弦相似性为正，表明存在共享的适应区域。
merged/multi-domain 策略与域专家之间的 KL 散度并不能严格预测性能下降；邻域策略迁移将域策略塑造成最优策略。
模型合并倾向于继承单任务模型的能力，而多任务训练则产生更广泛、涌现的能力，与单任务训练有所偏离。
RLVR 能够引入自我判别能力和跨域协同，多任务 RLVR 在提升结果判断与过程判断方面均有增强。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。