QUICK REVIEW

[论文解读] Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning

Yiqun Chen, Jinyuan Feng|arXiv (Cornell University)|Jan 29, 2026

Reinforcement Learning in Robotics被引用 0

一句话总结

介绍 SCMA，一种多智能体强化学习框架，联合训练三个角色（Reasoning、Segmentation、Scoring）在不增加测试时开销的情况下压缩推理过程，提升准确率并缩短长度，适用于若干模型和数据集。

ABSTRACT

The inference overhead induced by redundant reasoning undermines the interactive experience and severely bottlenecks the deployment of Large Reasoning Models. Existing reinforcement learning (RL)-based solutions tackle this problem by coupling a length penalty with outcome-based rewards. This simplistic reward weighting struggles to reconcile brevity with accuracy, as enforcing brevity may compromise critical reasoning logic. In this work, we address this limitation by proposing a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic. Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents: extbf{a Segmentation Agent} for decomposing the reasoning process into logical chunks, and extbf{a Scoring Agent} for quantifying the significance of each chunk. The Segmentation and Scoring agents collaboratively define an importance-weighted length penalty during training, incentivizing extbf{a Reasoning Agent} to prioritize essential logic without introducing inference overhead during deployment. Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1\% to 39.0\% while boosting accuracy by 4.33\% to 10.02\%. Furthermore, ablation studies and qualitative analysis validate that the synergistic optimization within the MARL framework fosters emergent behaviors, yielding more powerful LRMs compared to vanilla RL paradigms.

研究动机与目标

通过减少冗余的 CoT 步骤来推动并解决大型推理模型中的效率瓶颈。
提出一个 MARL 框架，利用专门的智能体将推理片段分解、评估和裁剪，而不牺牲正确性。
设计一个带有重要性加权长度惩罚的共享奖励，以选择性地移除冗余。
证明联合 MARL 优化能够在不同模型规模下实现更短的推理路径和更高的准确率。
分析出现的细粒度压缩行为，并展示训练阶段的收益而无部署开销。

提出的方法

将 SCMA 形式化为一个带有共享基础大型语言模型的 MARL 系统，Reasoning（生成 y）、Segmentation（将 y 拆分为片段）、Scoring（对每个片段分配重要性 w_i）这三个智能体。
用重要性加权长度惩罚 R(y|x)=R_acc(y|x) - alpha f(sum_i phi(w_i)*|s_i|) 替代朴素长度惩罚，其中 phi(w_i) 将重要性映射到惩罚权重。
通过 Multi-Agent GRPO 进行训练，共享参数 theta，使用一个共同全局奖励来共同演化 Reasoning、Segmentation 和 Scoring 策略。
为每个智能体定义结构化观察与动作，使用提示 P_reason、P_seg、P_score，以及确保正确格式与协作的类 XML 约束。
证明等价于在带权长度约束下最大化期望准确度，并提供格式化奖励以稳定 MARL 训练。

Figure 1: Overview of SCMA Compared to general RL with length penalty. (Left) The general RL calculates rewards by penalizing the length of the thinking process directly. (Right) The SCMA employs an importance-weighted length penalty within a multi-agent system.

实验结果

研究问题

RQ1RQ1: SCMA 是否能够在多数据集和模型规模上超越现有带有长度惩罚的 RL 基线，在达到简洁但高准确度的推理方面表现更优？
RQ2RQ2: 惩罚权重 alpha 如何影响推理长度与准确率之间的权衡，以及 SCMA 训练过程的稳定性？
RQ3RQ3: 多智能体协同优化对实现细粒度压缩是否至关重要，单智能体方法是否也能达到同等效果？
RQ4RQ4: 在 SCMA 训练中，细粒度分段与评分如何涌现以实现语义层面的压缩？

主要发现

Method	GSM8K_Acc	GSM8K_Tokens	MATH500_Acc	MATH500_Tokens	AIME24_Acc	AIME24_Tokens	AIME25_Acc	AIME25_Tokens	AMC23_Acc	AMC23_Tokens	Overall_Acc	Overall_Tokens
SCMA (Ours) - Qwen3-8B	94.99	369	89.20	1999	60.00	6475	43.33	7402	89.60	3599	75.42	3889

SCMA 将推理长度在 11.1% 到 39.0% 的范围内减少，同时准确率提升 4.33% 到 10.02%。
SCMA 即便在较小的基础模型（如 Qwen3-8B）上也能取得显著的性能提升，并伴随令牌数量显著减少（例如在某一场景中减少 369 个令牌）且达成较高的整体准确率（如 75.42）。
带有 MARL 协作的训练避免了 RL+LP 方法中出现的崩溃现象，并在效率与准确性上获得更稳定的提升。
消融显示若移除联合优化或使用更小的分段/评分模块，性能会下降，凸显协同学习的价值。
出现出现在细粒度压缩方面的证据：到步骤 40，分段变得对内容自适应，片段在语义上更密集，平均分数更高，片段数量减少。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。