QUICK REVIEW

[논문 리뷰] SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents

Feng Lin, Dong Jae Kim|arXiv (Cornell University)|2024. 03. 23.

Digital Rights Management and Security인용 수 8

한 줄 요약

LCG는 다중 에이전트 LLM을 사용해 Waterfall, TDD, Scrum 프로세스를 모방하여 코드 생성을 수행합니다; Scrum 기반 LCG가 Pass@1 이득이 가장 크고 모델 변형에 따른 결과가 더 안정적입니다.

ABSTRACT

Software process models are essential to facilitate collaboration and communication among software teams to solve complex development tasks. Inspired by these software engineering practices, we present FlowGen - a code generation framework that emulates software process models based on multiple Large Language Model (LLM) agents. We emulate three process models, FlowGenWaterfall, FlowGenTDD, and FlowGenScrum, by assigning LLM agents to embody roles (i.e., requirement engineer, architect, developer, tester, and scrum master) that correspond to everyday development activities and organize their communication patterns. The agents work collaboratively using chain-of-thought and prompt composition with continuous self-refinement to improve the code quality. We use GPT3.5 as our underlying LLM and several baselines (RawGPT, CodeT, Reflexion) to evaluate code generation on four benchmarks: HumanEval, HumanEval-ET, MBPP, and MBPP-ET. Our findings show that FlowGenScrum excels compared to other process models, achieving a Pass@1 of 75.2, 65.5, 82.5, and 56.7 in HumanEval, HumanEval-ET, MBPP, and MBPP-ET, respectively (an average of 15% improvement over RawGPT). Compared with other state-of-the-art techniques, FlowGenScrum achieves a higher Pass@1 in MBPP compared to CodeT, with both outperforming Reflexion. Notably, integrating CodeT into FlowGenScrum resulted in statistically significant improvements, achieving the highest Pass@1 scores. Our analysis also reveals that the development activities impacted code smell and exception handling differently, with design and code review adding more exception handling and reducing code smells. Finally, FlowGen models maintain stable Pass@1 scores across GPT3.5 versions and temperature values, highlighting the effectiveness of software process models in enhancing the quality and stability of LLM-generated code.

연구 동기 및 목표

소프트웨어 개발을 다중 에이전트 프로세스로 모델링하는 것이 코드 품질과 신뢰성을 향상시키는지 동기를 부여한다.
LCG, Waterfall, TDD, Scrum을 코드 생성을 위해 모방하는 에이전트 기반 프레임워크를 제안한다.
개발 활동과 프로세스 모델이 코드 정확성 및 코드 냄새에 미치는 영향을 조사한다.
LLM 모델 버전과 온도 설정에 따른 안정성을 평가한다.

제안 방법

요구사항 엔지니어, 아키텍트, 개발자, 테스터(스크럼의 경우 스크럼 마스터)라는 개발 역할을 LLM 에이전트로 정의한다.
Waterfall(정렬된 흐름), TDD(테스트 선행 구현 포함 정렬), Scrum(스프린트와 유사 회의가 비정렬)이라는 세 가지 상호작용 패턴을 구현한다.
사례별로 체인-오브-사고(chain-of-thought) 추론, 프롬프트 구성, 자기개선을 적용해 산출물을 반복적으로 개선한다.
네 가지 벤치마크(HumanEval, HumanEval-ET, MBPP, MBPP-ET)에 대해 Pass@1을 주요 지표로 제로샷 프롬프트로 평가한다.
동일한 조건에서 GPT-3.5 기본(GPT)과 비교 분석하고 코드 냄새 및 예외 처리도 분석한다.

Figure 1: An overview of $\textit{LCG}_{\textit{Waterfall}}$ , $\textit{LCG}_{\textit{TDD}}$ , and $\textit{LCG}_{\textit{Scrum}}$ .

실험 결과

연구 질문

RQ1다른 소프트웨어 프로세스 모델(Waterfall, TDD, Scrum)을 모방하는 것이 GPT 기본 대비 코드 생성 정확도(Pass@1)에 어떤 영향을 미치는가?
RQ2신뢰성 및 코드 냄새와 같은 코드 품질 속성에 영향을 주는 개발 활동은 무엇인가?
RQ3다양한 GPT-3.5 모델 버전과 온도 설정에서 LCG 결과의 안정성은 어떠한가?

주요 결과

모델	HumanEval	HumanEval-ET	MBPP	MBPP-ET
GPT	64.4 ± 3.7	49.8 ± 3.0	77.5 ± 0.8	53.9 ± 0.7
LCG_Waterfall	69.5 ± 2.3	59.4 ± 2.5	76.3 ± 0.9	51.1 ± 1.7
LCG_TDD	69.8 ± 2.2	60.0 ± 2.1	76.8 ± 0.9	52.8 ± 0.7
LCG_Scrum	75.2 ± 1.1	65.5 ± 1.9	82.5 ± 0.6	56.7 ± 1.4

LCG_Scrum은 모든 벤치마크에서 가장 높은 Pass@1을 달성: 75.2 (HumanEval), 65.5 (HumanEval-ET), 82.5 (MBPP), 56.7 (MBPP-ET).
LCG 변형은 일반적으로 GPT 기본 대비 향상되며 Pass@1 이득은 5.2%에서 31.5%까지이다.
LCG_Scrum은 벤치마크 전반에 걸쳐 평균 표준편차가 1.3%로 가장 안정적인 결과를 보인다.
테스트 제거는 Pass@1을 급격히 감소시키고(−17.0%에서 −56.1%), 코드 냄새를 증가시킨다.
설계 및 코드 리뷰는 리팩토링 및 경고 냄새를 줄이고 예외 처리 개선에 기여한다.
GPT 모델 버전 차이는 품질에 큰 영향을 주는 반면, LCG는 버전과 온도 설정에 대해 안정성을 제공한다.

Figure 2: Pass@1 across GPT3.5 versions.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.