Skip to main content
QUICK REVIEW

[论文解读] SWE-QA-Pro: A Representative Benchmark and Scalable Training Recipe for Repository-Level Code Understanding

Songcheng Cai, Zhiheng Lyu|arXiv (Cornell University)|Mar 17, 2026
Software Engineering Research被引用 0
一句话总结

SWE-QA-Pro 引入一个基于仓库级别的 QA 基准,来自具备可执行环境的长尾仓库的问题驱动主题,以及一个两阶段的代理式训练流程(SFT 再 RLAIF),使小型开源模型在该基准上超越若干强基线,包括 GPT-4o。

ABSTRACT

Agentic repository-level code understanding is essential for automating complex software engineering tasks, yet the field lacks reliable benchmarks. Existing evaluations often overlook the long tail topics and rely on popular repositories where Large Language Models (LLMs) can cheat via memorized knowledge. To address this, we introduce SWE-QA-Pro, a benchmark constructed from diverse, long-tail repositories with executable environments. We enforce topical balance via issue-driven clustering to cover under-represented task types and apply a rigorous difficulty calibration process: questions solvable by direct-answer baselines are filtered out. This results in a dataset where agentic workflows significantly outperform direct answering (e.g., a ~13-point gap for Claude Sonnet 4.5), confirming the necessity of agentic codebase exploration. Furthermore, to tackle the scarcity of training data for such complex behaviors, we propose a scalable synthetic data pipeline that powers a two-stage training recipe: Supervised Fine-Tuning (SFT) followed by Reinforcement Learning from AI Feedback (RLAIF). This approach allows small open models to learn efficient tool usage and reasoning. Empirically, a Qwen3-8B model trained with our recipe surpasses GPT-4o by 2.3 points on SWE-QA-Pro and substantially narrows the gap to state-of-the-art proprietary models, demonstrating both the validity of our evaluation and the effectiveness of our agentic training workflow.

研究动机与目标

  • Motivate the need for a repository-level QA benchmark that emphasizes tool use and codebase exploration over memorized knowledge.
  • Construct SWE-QA-Pro from diverse long-tail repositories with executable environments to cover under-represented task types.
  • Calibrate difficulty to filter out questions solvable by direct answers and ensure genuine agentic reasoning is required.
  • Propose a scalable two-stage training recipe (SFT followed by RLAIF) to enable small open models to learn repository-grounded tool usage and reasoning.
  • Demonstrate that agentic training improves performance beyond direct-answer baselines and narrows gaps to state-of-the-art models.

提出的方法

  • Construct benchmark via issue-driven clustering of 1.7M issues across 3,468 repositories, then human-grounded QA per topic with tool-enabled drafting and validation.
  • Use a multi-stage filtering/difficulty calibration that compares direct-answer baselines with tool-using runs to remove trivially solvable items.
  • Provide executable sandboxes from SWE-Rebench to ensure end-to-end exploration is possible for each item.
  • Synthesize training data with Claude Code-assisted generation to create 1,464 training questions and 26-repository coverage for evaluation.
  • Train small models with a two-stage recipe: supervised fine-tuning (SFT) on tool-invocation trajectories, then Reinforcement Learning from AI Feedback (RLAIF) using a judge-based reward that emphasizes correctness and grounding.
  • Evaluate with a rigorously designed LLM-as-Judge protocol, including explicit file-path/line-number references and a separate evaluation judge.
(a) Benchmark Construction Pipeline
(a) Benchmark Construction Pipeline

实验结果

研究问题

  • RQ1What is the diversification and coverage of a benchmark built from long-tail, executable repositories for repository-level QA?
  • RQ2Does enforcing tool-using interaction (as opposed to direct knowledge answering) yield a measurable performance gap that reflects genuine repository reasoning?
  • RQ3Can a scalable agentic training pipeline (SFT -> RLAIF) train small open models to outperform knowledge-only baselines on repository-grounded QA?
  • RQ4How far can agentic training close the gap to state-of-the-art proprietary models on SWE-QA-Pro?
  • RQ5What are the qualitative strengths/weaknesses of models in tool usage and multi-file reasoning across repository clusters?

主要发现

  • A substantial performance gap exists between direct-answer baselines and agent-based reasoning on SWE-QA-Pro, demonstrating the necessity of repository exploration.
  • A Qwen3-8B model trained with the SFT→RLAIF recipe surpasses GPT-4o on SWE-QA-Pro and narrows the gap to proprietary models.
  • The agentic workflow enables iterative, tool-enabled exploration without a pre-built index, outperforming many baselines that rely on retrieval.
  • Training with RL after SFT yields larger gains in correctness and completeness than increasing SFT data alone.
  • Claude Sonnet 4.5 achieves the highest overall score, with SWE-QA-Pro 8B (SFT+RL) approaching performance of larger agentic models such as Devstral-Small-2-24B-Instruct.
  • Tool usage efficacy matters: models with more effective tool use and grounded reasoning achieve higher scores, not merely larger tool call counts.
(b) Training Recipe
(b) Training Recipe

更好的研究,从现在开始

从论文设计到论文写作,大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成,并经人工编辑审核。