QUICK REVIEW

[논문 리뷰] Can AI Agents Generate Microservices? How Far are We?

Bassam Adnan, Matteo Esposito|arXiv (Cornell University)|2026. 03. 09.

Software System Performance and Reliability인용 수 0

한 줄 요약

AI 에이전트가 유지 관리 가능한 코드로 기능적 마이크로서비스를 생성하고, 깨끗한 상태 시나리오에서 높은 통합 정확도를 달성하지만, 완전히 자율적인 생성은 여전히 불완전하고 인간의 감독이 필요하다.

ABSTRACT

LLMs have advanced code generation, but their use for generating microservices with explicit dependencies and API contracts remains understudied. We examine whether AI agents can generate functional microservices and how different forms of contextual information influence their performance. We assess 144 generated microservices across 3 agents, 4 projects, 2 prompting strategies, and 2 scenarios. Incremental generation operates within existing systems and is evaluated with unit tests. Clean state generation starts from requirements alone and is evaluated with integration tests. We analyze functional correctness, code quality, and efficiency. Minimal prompts outperformed detailed ones in incremental generation, with 50-76% unit test pass rates. Clean state generation produced higher integration test pass rates (81-98%), indicating strong API contract adherence. Generated code showed lower complexity than human baselines. Generation times varied widely across agents, averaging 6-16 minutes per service. AI agents can produce microservices with maintainable code, yet inconsistent correctness and reliance on human oversight show that fully autonomous microservice generation is not yet achievable.

연구 동기 및 목표

다양한 맥락 정보 수준에서 명시적 API 계약을 가진 기능적인 마이크로서비스를 AI 에이전트가 생성할 수 있는지 평가한다.
점진적 생성 및 클린 상태 생성 설정에서 자동화된 테스트를 통해 기능적 정확성을 평가한다.
AI가 생성한 마이크로서비스의 코드 품질 및 소프트웨어 지표를 인간 기준선과 비교한다.
시나리오 전반에 걸쳐 시간, 토큰, 비용의 측면에서 다양한 AI 에이전트의 효율성을 평가한다.

제안 방법

3개의 에이전트, 4개 프로젝트, 2개의 프롬프트 전략, 2가지 시나리오에 걸쳐 144개의 마이크로서비스 생성을 평가한다.
기존 시스템 내에서 유닛 테스트를 통해 점진적 생성으로 검증하고; 요구사항에서부터 클린 상태 생성을 통해 통합 테스트로 검증한다.
기능적 정확성(테스트 통과율), 코드 품질(SLOC, Cyclomatic Complexity, SonarQube를 통한 Cognitive Complexity), 그리고 효율성(토큰, 시간, 비용)을 측정한다.
네 가지 조건에 대해 두 가지 프롬프트 전략(P1: 최소한의 맥락; P2: 구현 요약 포함)을 적용한다.
시나리오 및 프롬프트를 비교하기 위해 알파 = 0.01로 Anderson-Darling, Wilcoxon 부호순위, Dunn-All 등의 비모수 통계 분석을 수행한다.

실험 결과

연구 질문

RQ1RQ 1: 서로 다른 맥락 시나리오에서 허용 가능한 기능적 정확성과 코드 품질을 가진 마이크로서비스를 AI 에이전트가 생성할 수 있는가?
RQ2RQ 1.1: 기존 시스템 맥락이 있는 점진적 생성에서 AI 에이전트의 성능은 어떠한가?
RQ3RQ 1.2: 요구사항만으로의 클린 상태 생성에서 AI 에이전트의 성능은 어떠한가?
RQ4RQ 2: 마이크로서비스를 생성할 때 시간, 토큰, 비용 측면에서 AI 에이전트 간의 효율성 차이는 어떠한가?

주요 결과

점진적 생성에서 평균 단위 테스트 통과율은 에이전트 및 프롬프트 전략에 따라 대략 50–76% 범위이다.
클린-상태 생성에서 통합 테스트 통과율은 더 높아져 81–98%로 평균화되며, API 계약 준수가 강하함을 나타낸다.
생성된 코드는 일반적으로 인간 기준선보다 복잡도가 낮은 편이다(Cyclomatic Complexity 및 Cognitive Complexity가 더 낮다).
마이크로서비스를 생성하는 데 걸리는 시간은 에이전트에 따라 평균 6–16분이다; Codex의 경우 특히 느릴 수 있으며 최악의 경우 1.74시간까지의 이상치가 있다.
마이크로서비스당 비용은 에이전트에 따라 다르며, Code Qwen이 가장 비용 효율적(약 $2.98/서비스)이고 Claude Code가 가장 비싼 편(약 $13.28/서비스)이다.
P1 프롬프트(최소 맥락)가 점진적 생성에서 P2 프롬프트를 능가하는 반면, 일부 에이전트에서 클린 상태 생성은 지침으로 이점을 얻는다; 전반적으로 API 계약 준수를 위해서는 인간의 감독이 여전히 필요하다.

Figure 3 : Code Quality Metrics comparison: Lines of Code (LoC), Cyclomatic Complexity (CycC), and Cognitive Complexity (CogC). Top row shows comparison by agent, bottom row shows comparison by configuration.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.