QUICK REVIEW

[论文解读] Training LLMs for Divide-and-Conquer Reasoning Elevates Test-Time Scalability

Xiao Liang, Zhong-Zhi Li|arXiv (Cornell University)|Feb 2, 2026

Topic Modeling被引用 0

一句话总结

本文提出 DAC-RL，是一种端到端强化学习框架，训练大型语言模型来执行 divide-and-conquer (DAC) 推理，相比 chain-of-thought (CoT) 提升推理上限和测试时可扩展性。与 CoT 相比，DAC-RL 在竞赛级基准上显著提升 Pass@1 和 Pass@32。

ABSTRACT

Large language models (LLMs) have demonstrated strong reasoning capabilities through step-by-step chain-of-thought (CoT) reasoning. Nevertheless, at the limits of model capability, CoT often proves insufficient, and its strictly sequential nature constrains test-time scalability. A potential alternative is divide-and-conquer (DAC) reasoning, which decomposes a complex problem into subproblems to facilitate more effective exploration of the solution. Although promising, our analysis reveals a fundamental misalignment between general-purpose post-training and DAC-style inference, which limits the model's capacity to fully leverage this potential. To bridge this gap and fully unlock LLMs' reasoning capabilities on the most challenging tasks, we propose an end-to-end reinforcement learning (RL) framework to enhance their DAC-style reasoning capacity. At each step, the policy decomposes a problem into a group of subproblems, solves them sequentially, and addresses the original one conditioned on the subproblem solutions, with both decomposition and solution integrated into RL training. Under comparable training, our DAC-style framework endows the model with a higher performance ceiling and stronger test-time scalability, surpassing CoT by 8.6% in Pass@1 and 6.3% in Pass@32 on competition-level benchmarks.

研究动机与目标

Motivate the need for DAC-style reasoning as a scalable alternative to sequential CoT for challenging tasks.
Identify misalignment between general post-training and DAC-style inference that limits DAC potential.
Propose an end-to-end RL framework to train LLMs for DAC reasoning.
Demonstrate that DAC-RL raises the performance ceiling and improves test-time scalability on math benchmarks.

提出的方法

Formalize DAC reasoning as a division step (generate subproblems) followed by conquering (solve subproblems then original problem).
Introduce a unified RL objective that jointly optimizes division and conquering rewards (Eq. 1).
Define a division reward that combines format validity, quantity validity, and helpfulness (Eq. 2).
Train with two-stage DAC: generate Gd subproblem groups and Gc conquering solutions per group; use final answer correctness as the conquering reward (Eq. 3).
Evaluate on competition-level benchmarks (AIME 2024/2025, Beyond-AIME, HMMT) with passes@k metrics; compare DAC-RL to Init-CoT, Init-DAC, RL-CoT, and RL-DAC.

实验结果

研究问题

RQ1Can end-to-end RL training unlock DAC-style reasoning in LLMs beyond what post-training CoT provides?
RQ2Does DAC-style training yield higher performance ceilings and better test-time scalability than CoT on frontier math benchmarks?
RQ3How do subproblem division quality and conquering quality jointly impact final problem-solving performance?
RQ4What is the effect of deep DAC training and cold-start distillation on DAC capabilities?
RQ5What is the optimal test-time DAC configuration (division vs conquering allocation) for scalability?

主要发现

Model	AIME 2024 Pass@1	AIME 2024 Pass@32	AIME 2025 Pass@1	AIME 2025 Pass@32	Beyond-AIME Pass@1	Beyond-AIME Pass@32	HMMT 2025 Pass@1	HMMT 2025 Pass@32	Average Pass@1	Average Pass@32
Qwen2.5-7B-Instruct Init-CoT	9.8	26.7	6.8	36.7	3.8	23.0	2.0	10.0	5.6	24.1
Qwen2.5-7B-Instruct Init-DAC	0.5	13.3	0.2	6.7	0.7	10.0	0.2	6.7	0.4	9.2
Qwen2.5-7B-Instruct RL-CoT	13.5	34.5	11.4	30.8	5.1	25.5	2.7	13.1	8.2	27.0
Qwen2.5-7B-Instruct RL-DAC	15.5	39.1	15.5	34.2	7.0	27.4	4.8	20.8	10.4	30.4
Qwen3-4B-Instruct-2507 Init-CoT	62.6	90.0	45.7	76.7	32.1	65.0	30.3	56.7	42.7	72.1
Qwen3-4B-Instruct-2507 Init-DAC	59.6	90.0	43.2	73.3	29.6	61.0	28.2	63.3	40.2	71.9
Qwen3-4B-Instruct-2507 RL-CoT	45.9	85.8	52.1	77.4	30.4	58.1	21.8	54.4	37.5	69.0
Qwen3-4B-Instruct-2507 RL-DAC	63.9	87.7	54.2	78.8	34.6	67.9	31.9	66.6	46.1	75.3
Qwen3-4B-Instruct-2507 (Deep) RL-D-CoT	64.4	84.8	58.8	87.9	38.9	69.5	37.6	65.5	49.9	76.9
Qwen3-4B-Instruct-2507 (Deep) RL-D-DAC	66.3	91.6	61.5	87.6	38.8	70.7	38.7	76.4	51.3	81.6

DAC-style training yields higher ceilings than CoT, improving Pass@1 and Pass@32 across competition benchmarks (e.g., +8.6% Pass@1, +6.3% Pass@32 for certain models).
DAC-RL outperforms CoTRL and other baselines even when initial DAC performance is low, indicating a strong training-time advantage.
Deep DAC training further enhances reasoning and test-time scalability, especially on harder problems, with notable gains over CoT baselines.
Mix-RL (combining CoT and DAC) can boost CoT performance on simpler tasks while still enabling DAC reasoning on hard tasks.
Test-time DAC configurations show that more subproblem groups (n) with fewer conquering solutions (m) improve performance over a fixed budget.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。