QUICK REVIEW

[论文解读] Architecture-Aware Multi-Design Generation for Repository-Level Feature Addition

Mingwei Liu, Zhenxi Chen|arXiv (Cornell University)|Mar 2, 2026

Software Engineering Research被引用 0

一句话总结

RAIM 引入一个面向代码库层级、面向架构的框架，生成多种设计补丁并通过影响分析选择补丁，在 NoCode-bench Verified 上达到最先进的结果。

ABSTRACT

Implementing new features across an entire codebase presents a formidable challenge for Large Language Models (LLMs). This proactive task requires a deep understanding of the global system architecture to prevent unintended disruptions to legacy functionalities. Conventional pipeline and agentic frameworks often fall short in this area because they suffer from architectural blindness and rely on greedy single-path code generation. To overcome these limitations, we propose RAIM, a multi-design and architecture-aware framework for repository-level feature addition. This framework introduces a localization mechanism that conducts multi-round explorations over a repository-scale code graph to accurately pinpoint dispersed cross-file modification targets. Crucially, RAIM shifts away from linear patching by generating multiple diverse implementation designs. The system then employs a rigorous impact-aware selection process based on static and dynamic analysis to choose the most architecturally sound patch and avoid system regressions. Comprehensive experiments on the NoCode-bench Verified dataset demonstrate that RAIM establishes a new state-of-the-art performance with a 39.47% success rate, achieving a 36.34% relative improvement over the strongest baseline. Furthermore, the approach exhibits robust generalization across various foundation models and empowers open-weight models like DeepSeek-v3.2 to surpass baseline systems powered by leading proprietary models. Detailed ablation studies confirm that the multi-design generation and impact validation modules are critical to effectively managing complex dependencies and reducing code errors. These findings highlight the vital role of structural awareness in automated software evolution.

研究动机与目标

motivate automated, repository-level feature addition as a proactive software evolution task requiring architectural awareness.
Propose RAIM to address architectural blindness and linear generation in existing methods.
Develop a four-stage framework: architecture-aware localization, multi-design patch generation, and impact-aware patch selection.
Demonstrate RAIM's effectiveness and generalization across multiple LLMs and open-weight models on NoCode-bench Verified.

提出的方法

Construct a repository-level code graph to capture semantic and structural relationships.
Perform architecture-aware file and function localization via multi-round searches on the code graph.
Generate multiple diverse implementation designs and corresponding patches.
Evaluate candidate patches with static change impact analysis and dynamic test execution to select an optimal patch.

实验结果

研究问题

RQ1RAIM 在代码库层级特征添加任务上相比最先进基线的表现如何？
RQ2RAIM 能否跨不同大语言模型（LLM）实现泛化并有效处理跨文件特征添加？
RQ3每个 RAIM 组件（定位、多设计生成、影响分析）对总体性能的贡献是什么？
RQ4补丁选择策略在在特征正确性与架构完整性之间的平衡上有多有效？

主要发现

方法	模型	RT (%)	FV-Micro (%)	FV-Macro (%)	成功率 (%)
OpenHands	Qwen3-235B	47.37	1.96	14.03	7.89
DeepSeek-R1	Qwen3-235B	46.49	0.47	10.86	7.02
DeepSeek-v3	Qwen3-235B	49.12	1.68	18.29	11.40
Gemini-2.5-Pro	-	61.40	0.01	0.29	0.00
Claude-4-Sonet	-	69.30	11.25	36.48	25.44
Agentless	Qwen3-235B	76.32	8.75	22.39	13.16
GPT-5-Chat	-	82.46	8.50	33.01	18.42
DeepSeek-R1	-	73.68	10.87	35.52	25.44
DeepSeek-v3	-	78.95	7.96	32.80	21.05
DeepSeek-v3.2	-	28.95	9.46	37.42	28.95
DeepSeek-v3.2-thinking	-	79.82	8.41	37.02	27.19
Gemini-2.5-Pro	-	74.56	6.22	20.55	12.28
Claude-4-Sonet	-	79.82	8.47	38.48	28.07
RAIM	Qwen3-235B	79.82	9.76	27.45	16.67
GPT-5-Chat	-	89.47	13.43	32.33	21.93
DeepSeek-v3	-	81.58	15.14	35.64	25.44
DeepSeek-R1	-	77.19	12.47	41.79	29.82
DeepSeek-v3.2	-	85.96	16.01	45.58	34.21
DeepSeek-v3.2-thinking	-	78.07	11.93	41.74	29.82
Gemini-2.5-Pro	-	82.46	17.16	52.09	39.47
-	-	-	-	-	-

RAIM 在 NoCode-bench Verified 上使用 Gemini-2.5-Pro 达到新的最先进成功率 39.47%，相较上一最佳提升 36.34% 的相对幅度。
RAIM 还使 Open-weight 模型如 DeepSeek-v3.2 达到 34.21% 的成功率，超越使用更强的专有模型的若干基线。
消融研究表明多设计生成与影响验证对于管理复杂依赖和减少代码错误均至关重要。
RAIM 展现出对 7 种 LLM 的鲁棒泛化能力，在跨文件修改等复杂任务上取得显著提升。
该方法强调架构意识与变更影响分析，以防止生产级软件的回归。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。