QUICK REVIEW

[Paper Review] SWE-World: Building Software Engineering Agents in Docker-Free Environments

Shuang Sun, Huatong Song|arXiv (Cornell University)|Feb 3, 2026

Software Engineering Research0 citations

TL;DR

SWE-World introduces a Docker-free surrogate environment for training and evaluating software engineering agents by replacing containerized execution with learned LLM-based models, enabling scalable SFT, RL, and test-time scaling.

ABSTRACT

Recent advances in large language models (LLMs) have enabled software engineering agents to tackle complex code modification tasks. Most existing approaches rely on execution feedback from containerized environments, which require dependency-complete setup and physical execution of programs and tests. While effective, this paradigm is resource-intensive and difficult to maintain, substantially complicating agent training and limiting scalability. We propose SWE-World, a Docker-free framework that replaces physical execution environments with a learned surrogate for training and evaluating software engineering agents. SWE-World leverages LLM-based models trained on real agent-environment interaction data to predict intermediate execution outcomes and final test feedback, enabling agents to learn without interacting with physical containerized environments. This design preserves the standard agent-environment interaction loop while eliminating the need for costly environment construction and maintenance during agent optimization and evaluation. Furthermore, because SWE-World can simulate the final evaluation outcomes of candidate trajectories without real submission, it enables selecting the best solution among multiple test-time attempts, thereby facilitating effective test-time scaling (TTS) in software engineering tasks. Experiments on SWE-bench Verified demonstrate that SWE-World raises Qwen2.5-Coder-32B from 6.2\% to 52.0\% via Docker-free SFT, 55.0\% with Docker-free RL, and 68.2\% with further TTS. The code is available at https://github.com/RUCAIBox/SWE-World

Motivation & Objective

Motivate reducing reliance on resource-intensive Docker-based environments for SWE agents.
Propose a Docker-free surrogate environment that predicts execution feedback and test outcomes.
Enable scalable training (SFT and RL) and test-time scaling without physical containers.
Leverage real-world SWE data to improve agent learning efficiency.

Proposed method

Partition agent actions into lightweight navigation/editing handled by a deterministic sandbox and code-execution actions handled by SWT, a learned transition model.
Train SWT to predict step-level execution feedback from repository-level actions using context that includes instance metadata, agent patch, and execution content.
Train SWR to simulate final test evaluation and produce structured test feedback plus a binary reward, using evaluation context with unit tests.
Collect training data from real Docker rollouts to supervise SWT and SWR via SFT using Qwen-based backbones.
Use reverse-reasoning distillation to generate CoT-augmented training data for SWT and SWR to improve reasoning about repository behavior.
Conduct Docker-free RL using GRPO with SWT providing transition feedback and SWR providing terminal rewards.
Implement test-time scaling (TTS) by evaluating multiple candidate trajectories with SWR-powered verification to select the best.

Experimental results

Research questions

RQ1Can a learned surrogate environment approximate Docker-based execution feedback sufficiently for training SWE agents?
RQ2How well do SFT and RL perform for SWE tasks when trained entirely with Docker-free feedback?
RQ3Does Docker-free training plus TTS match or exceed Docker-based baselines on real SWE benchmarks?
RQ4What data and model scales are needed to achieve competitive SWE performance without containers?

Key findings

Docker-free training with SWE-World substantially improves agent performance on SWE-bench Verified, e.g., Qwen2.5-Coder-32B from 6.2% to 52.0% (SFT) and 55.0% (RL).
SWE-World with TTS reaches 68.2% resolve rate, surpassing prior Docker-based results in some settings.
SWT (transition model) and SWR (reward model) provide competitive, interpretable surrogate feedback and evaluation signals, with SWR achieving higher accuracy and precision than baselines.
A broad SWE-World dataset (16.6K tasks, 3,763 repos) enables scalable, Docker-free training by leveraging real-world data.
Docker-free RL reduces infrastructure needs by eliminating container rollout during training, while maintaining competitive performance with traditional Docker pipelines.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.