Skip to main content
QUICK REVIEW

[论文解读] Composer 2 Technical Report

Cursor Reseach, :|arXiv (Cornell University)|Mar 25, 2026
Software Engineering Research被引用 0
一句话总结

tldr: Composer 2 is a frontier-level coding model for agentic software engineering, trained in continued pretraining and asynchronous reinforcement learning, achieving strong CursorBench and public benchmark scores.

ABSTRACT

Composer 2 is a specialized model designed for agentic software engineering. The model demonstrates strong long-term planning and coding intelligence while maintaining the ability to efficiently solve problems for interactive use. The model is trained in two phases: first, continued pretraining to improve the model's knowledge and latent coding ability, followed by large-scale reinforcement learning to improve end-to-end coding performance through stronger reasoning, accurate multi-step execution, and coherence on long-horizon realistic coding problems. We develop infrastructure to support training in the same Cursor harness that is used by the deployed model, with equivalent tools and structure, and use environments that match real problems closely. To measure the ability of the model on increasingly difficult tasks, we introduce a benchmark derived from real software engineering problems in large codebases including our own. Composer 2 is a frontier-level coding model and demonstrates a process for training strong domain-specialized models. On our CursorBench evaluations the model achieves a major improvement in accuracy compared to previous Composer models (61.3). On public benchmarks the model scores 61.7 on Terminal-Bench and 73.7 on SWE-bench Multilingual in our harness, comparable to state-of-the-art systems.

研究动机与目标

  • Objective: 1) 通过持续预训练和强化学习推进对领域特定编码模型的缩放规律理解。 2) 开发反映真实世界软件工程任务的基础设施与基准,降低训练-测试不匹配。 3) 展示在内部 CursorBench 与公开 SWE 基准上的性能提升。 4) 展示在生产环境中平衡编码准确性与高效部署的方法。

提出的方法

  • Method: 1) 在以代码为主的大型数据混合上进行继续预训练,以提升编码知识与潜在能力。 2) 使用策略梯度和每个提示的多样本的异步强化学习,以提升端到端编码性能。 3) 自我总结以通过摘要串联多轮生成来实现长时任务处理。 4) 基于 CursorBench 的评估以反映真实世界、未明确规定的开发者任务,并衡量代码质量、执行效率和交互行为。 5) 基础设施创新包括 Context Parallelism、去耦 MoE 设计,以及用于可扩展训练的专门量化与内核实现。
Figure 1: Composer 2 improves greatly from previous Composer models, achieving performance competitive with state-of-the-art models. By specializing entirely on coding ability, Composer attains such performance while being lower cost to serve than state-of-the-art model API pricing. See Section 5 fo
Figure 1: Composer 2 improves greatly from previous Composer models, achieving performance competitive with state-of-the-art models. By specializing entirely on coding ability, Composer attains such performance while being lower cost to serve than state-of-the-art model API pricing. See Section 5 fo

实验结果

研究问题

  • RQ1Research Questions: 1) 继续预训练如何影响面向编码代理的下游 RL 性能? 2) 哪些训练与推理架构在准确性、延迟和稳定性之间实现最佳平衡? 3) 自我摘要与长时序串联是否能在不等大量上下文的情况下提升对扩展编码任务的性能? 4) CursorBench 相比公开基准在反映代理实际编码任务方面有何差异?

主要发现

  • Key Findings: 1) Composer 2 在 CursorBench 上相比早期模型有显著提升(61.3),在 Terminal-Bench(61.7)和 SWE-bench Multilingual(73.7)达到可比分数。 2) RL 训练在训练过程中显示平均与最佳-在-K 性能提升,表明对正确解的更广覆盖,而非仅仅重新加权已知轨迹。 3) 继续预训练与下游 RL 奖励及评估损失下降相关,支持计划中的两阶段训练策略。 4) 自我总结实现了更高效的长时序推理,使用更少的 tokens 且保留 KV 缓存,提升困难任务表现。 5) 基础设施结合了高级并行(Context Parallelism)、MoE 去耦和专用低精度内核,实现可扩展训练与鲁棒推理。
Figure 2: Continued pretraining translates to downstream RL performance. Left: We study this relationship on a smaller Qwen model, examining checkpoints trained on a varying number of tokens. Right: The model undergoes a steady decrease in training perplexity.
Figure 2: Continued pretraining translates to downstream RL performance. Left: We study this relationship on a smaller Qwen model, examining checkpoints trained on a varying number of tokens. Right: The model undergoes a steady decrease in training perplexity.

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。