QUICK REVIEW

[论文解读] SWE-Replay: Efficient Test-Time Scaling for Software Engineering Agents

Yifeng Ding, Lingming Zhang|arXiv (Cornell University)|Jan 29, 2026

Software Engineering Research被引用 0

一句话总结

SWE-Replay 重用归档轨迹以实现现代 SWE 代理在测试时的高效扩展，在多个基准和后端上降低采样成本，同时保持或提高解题质量。

ABSTRACT

Test-time scaling has been widely adopted to enhance the capabilities of Large Language Model (LLM) agents in software engineering (SWE) tasks. However, the standard approach of repeatedly sampling trajectories from scratch is computationally expensive. While recent methods have attempted to mitigate costs using specialized value agents, they can suffer from model miscalibration and fail to generalize to modern agents that synthesize custom bash scripts as tools. In this paper, we introduce SWE-Replay, the first efficient and generalizable test-time scaling technique for modern agents without reliance on potentially noisy value estimates. SWE-Replay optimizes the scaling process by recycling trajectories from prior trials, dynamically choosing to either explore from scratch or exploit archived experience by branching at critical intermediate steps. This selection of intermediate steps is driven by the potential and reasoning significance of repository exploration, rather than external LLM-based quality estimates. Our evaluation shows that, on SWE-Bench Verified, SWE-Replay consistently outperforms naive scaling, reducing costs by up to 17.4% while maintaining or even improving performance by up to 3.8%. Further evaluation on SWE-Bench Pro and Multilingual validates the generalizability of SWE-Replay, establishing it as a robust foundation for efficient test-time scaling of software engineering agents.

研究动机与目标

实现对现代 SWE 代理与代码库在测试时高效扩展需求的动机。
引入 SWE-Replay 作为一种可泛化的轨迹重用方法，该方法不依赖基于大语言模型的质量估计。
在 SWE-Bench Verified、Pro 和 Multilingual 上展示成本与性能提升。
分析 SWE-Replay 的组成部分（选择、分组、过滤）及其对性能的贡献。
提供关于基于重放的探索为何提升效率的经验与理论直观解释。

提出的方法

维护采样轨迹的归档，并在关键中间步骤处通过分支控制从头探索或利用归档轨迹进行利用性探索的迭代决策。
以抽象的仓库状态（在该步之前已探索的文件集合）来表示步骤，并以稀有性为基础的 softmax 对状态进行采样，鼓励探索访问较少的区域。
使用推理段落数量作为推理强度的代理指标来优先考虑需要推理的步骤，以引导分支。
在选定步骤之前通过应用存储的差异来尽可能恢复环境状态；如有需要，则通过重放操作来最小化开销。
在选定的关键步骤处通过用新采样的步骤替换该步骤来分支，并继续探索以形成新的轨迹加入归档。
将 SWE-Replay 与天真扩展以及将 LLM 作为评判基线进行比较，以评估效率与性能。
提供消融研究以验证轨迹过滤、状态抽象和基于推理的步骤选择的作用。

实验结果

研究问题

RQ1在多个 SWE 基准和后端上，SWE-Replay 是否能持续比天真扩展降低轨迹采样成本？
RQ2SWE-Replay 是否可泛化到不同的代理框架与语言（如 SWE-Bench Verified、Pro、Multilingual）？
RQ3各组成部分（轨迹过滤、状态抽象、基于推理的步骤选择）对性能与效率的影响如何？
RQ4在 SWE-Replay 下，探索的仓库文件多样性相比天真扩展如何变化？

主要发现

SWE-Replay 将天真测试时扩展的成本降低最多实现了 17.4%，同时保持或提升性能最多达 3.8%。
在 SWE-Bench Pro 与 Multilingual 上，SWE-Replay 实现了最高 22.6% 的性能提升，成本下降最多 9.0%，证明了对多样化 SWE 问题的可泛化性。
SWE-Replay 将探索重点转向长期尾部的仓库文件，提升了与天真扩展相比的文件多样性。
理论直观显示，在合理假设下，SWE-Replay 的重放策略在成功概率上至少与随机选择等价，从而为其效率提升提供理由。
消融研究表明，移除任一组件（轨迹过滤、状态分组、或基于推理的步骤选择）都会降低性能与效率，证实了完整管线的必要性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。