QUICK REVIEW

[논문 리뷰] An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Max Schäfer, Sarah Nadi|arXiv (Cornell University)|2023. 02. 13.

Software Testing and Debugging Techniques인용 수 53

한 줄 요약

본 논문은 추가 학습 없이 자바스크립트 단위 테스트를 생성하는 적응형 LLM 기반 도구인 TestPilot을 제시하며, 높은 커버리지와 다양하고 비복사(non-copy) 테스트를 25개의 npm 패키지에 걸쳐 촉진한다. 또한 Nessie와의 비교 및 다양한 프롬프트 구성 요소와 LLM의 효과를 탐구한다.

ABSTRACT

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.

연구 동기 및 목표

개발자의 노력을 줄이기 위해 단위 테스트 생성을 자동화하도록 동기를 부여한다.
미세 조정 없이 일반적으로 구입 가능한(off-the-shelf) LLM이 효과적인 단위 테스트를 생성할 수 있는지 평가한다.
LLM이 생성한 테스트의 커버리지와 테스트 품질(단정, 비사소한 단정)을 평가한다.
프롬프트 구성 요소가 테스트 생성 효율성에 미치는 영향을 분석한다.
기존의 테스트 생성 기법과 여러 LLM에 대해 TestPilot를 비교한다.

제안 방법

function signatures, 문서화, 사용 예를 포함하는 프롬프트를 사용한 LLM(gpt3.5-turbo)를 이용한 프롬프트 기반 테스트 생성.
적응형 재프롬프트: 생성된 테스트가 실패하면 실패 내용과 오류 메시지를 포함하여 테스트를 수정하도록 재프롬프트한다.
다섯 부분으로 구성된 TestPilot 아키텍처: API Explorer, Documentation Miner, Prompt Generator, Test Validator, 및 Prompt Refiner.
런타임에 패키지를 검사하여 테스트 가능한 함수를 식별하는 JavaScript의 동적 API 검색.
생성된 테스트를 검증하고 다듬기 위해 Mocha 기반의 테스트 생성 및 실행.
Nessie와 대안 LLM들(code-cushman-002 및 StarCoder)과의 비교 실험.

실험 결과

연구 질문

RQ1RQ1 TestPilot가 생성한 테스트가 달성하는 명령문 커버리지와 분기 커버리지는 어느 정도인가?
RQ2RQ2 테스트 프롬프트에서 서로 다른 정보 구성 요소(본문, 사용 예시, 문서 주석)를 제거하거나 포함했을 때 TestPilot의 프롬프트의 효과는 어떤가?
RQ3RQ3 서로 다른 LLM(GPT-3.5-turbo, code-cushman-002, StarCoder)에서 TestPilot의 성능은 어떻게 나타나는가?
RQ4RQ4 생성된 테스트가 기존 테스트와 얼마나 유사한가(즉, 메모리화되었거나 학습 데이터에서 복사되었는가)?
RQ5RQ5 생성된 테스트에 기능을 실제로 다루는 비사소한 단정이 포함되어 있는가?

주요 결과

25개의 npm 패키지에서 중앙값 문장 커버리지 70.2% 및 분기 커버리지 52.8%를 달성.
비교를 위한 Nessie는 문장 커버리지 51.3% 및 분기 커버리지 25.6%를 달성.
TestPilot 테스트의 92.8%가 기존 테스트와 유사도 <= 50%이며(정확한 복사는 없음).
테스트의 60.0%가 기존 테스트와의 유사도 <= 40%(그리고 92.8%가 <= 50%)를 보인다.
적응형 재프롬프트가 실패한 테스트 약 15.6%를 수정한다.
code-cushman-002를 사용했을 때(68.2% stat, 51.2% branch)와 StarCoder를 사용했을 때(54.0% stat, 37.5% branch)의 결과는 질적으로 유사하다.
모든 다섯 프롬프트 구성 요소는 고품질 테스트 생성을 위해 필수적이며, 어느 구성 요소도 제거하면 효과가 감소한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.