QUICK REVIEW

[논문 리뷰] Can Large Language Models Write Good Property-Based Tests?

Vasudev Vikram, Caroline Lemieux|arXiv (Cornell University)|2023. 07. 10.

Software Engineering Research인용 수 13

한 줄 요약

해당 논문은 API 문서로부터 프로퍼티 기반 테스트(PBT)를 합성하기 위해 GPT-4 기반 프롬프트를 사용하는 것을 조사하고, 세 가지 프례프팅 전략과 제너레이터 및 프로퍼티 품질 평가 방법론을 도입한 PBT-GPT를 제시한다.

ABSTRACT

Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.

연구 동기 및 목표

실세계 소프트웨어에서 PBT의 저활용 원인을 설명하고 제너레이터 생성과 의미 있는 프로퍼티의 도출에서의 고충점을 파악한다.
API 문서를 사용해 LLM으로부터 PBT 구성 요소를 합성하는 방법(PBT-GPT)을 제안한다.
제너레이터와 프로퍼티 합성을 위한 독립적, 연속적, 공동의 세 가지 프롑프팅 전략을 소개한다.
제너레이터의 타당성/다양성 및 프로퍼티의 타당성/정당성/강도라는 평가 방법론을 개발한다.
numpy, networkx, datetime 등의 파이썬 라이브러리 API에 대한 예비 결과를 제시하여潜在한 이점과 한계를 설명한다.

제안 방법

API 문서, 시스템/사용자 지시사항, 그리고 제너레이터, 프로퍼티 또는 이들 모두의 출력 형식을 지정하는 프롬프트 템플릿을 설계한다.
세 가지 프롑프팅 방법을 정의한다 — 제너레이터와 프로퍼티를 독립적으로 프롬프팅, 컨텍스트를 가진 연속 프롬프팅, 그리고 함께 프롬프팅하여 결합된 테스트를 생성.
PBT-GPT의 실패 모드를 특징화하고 제너레이터의 유효성/다양성 및 프로퍼티의 유효성/정당성/강도에 중점을 둔 평가 방법론을 제안한다.
Hypothesis를 PBT 프레임워크로 사용하여 샘플 파이썬 API에서 PBT-GPT를 구현하고 평가한다.
제너레이터의 타당성, 다양성 및 프로퍼티의 타당성/정당성/강도를 개선하기 위한 제어 전략과 인간-루프 접근법을 논의한다.

Figure 1 : Truncated Numpy documentation for the numpy.cumsum API method. The documentation has natural language descriptions of properties about the result shape/size and additional information about the last element of the result.

실험 결과

연구 질문

RQ1LLM이 API 문서에서 사용할 수 있는 프로퍼티 기반 테스트를 합성할 수 있는가?
RQ2다양한 프롬프팅 전략이 생성된 PBT 구성 요소(제너레이터와 프로퍼티)의 품질에 어떤 영향을 미치는가?
RQ3LLM이 생성한 PBT에서 일반적으로 나타나는 실패 모드는 무엇이며 이를 어떻게 완화할 수 있는가?
RQ4합성된 PBT의 제너레이터 타당성/다양성 및 프로퍼티 타당성/정당성/강도를 어떻게 평가할 수 있는가?
RQ5numpy, networkx, datetime API에 PBT-GPT를 적용했을 때 관찰되는 예비 결과는 무엇인가?

주요 결과

PBT-GPT는 numpy, networkx, datetime의 API 문서로부터 도출된 제너레이터와 프로퍼티에 대해 유망한 예비 결과를 제공한다.
독립적, 연속적, 공동의 세 가지 프롑프팅 전략은 제너레이터와 프로퍼티 합성 간의 다양한 트레이드오프를 제시한다.
제너레이터는 타당성 및 다양성 문제를 보일 수 있으며, 프로퍼티는 유효하지 않거나 정당하지 않거나 약할 수 있어 완화 또는 인간-루프 보완이 필요하다.
제너레이터 타당성, 제너레이터 다양성, 프로퍼티 타당성, 프로퍼티 정당성 및 강도에 초점을 맞춘 평가 방법론을 제안하고 예시로 시연한다.
유효하지 않은 제너레이터/프로퍼티를 수정하기 위한 지속적 프롬프팅 및 예제를 통한 음성/강도 개선 등의 보완 전략을 제시한다.
초기 결과는 LLM이 합성한 PBT가 개발자가 테스트를 개선하기 위한 유용한 출발점이 될 수 있음을 시사한다.

Figure 4 : An example prompt for synthesizing the generator function of a networkx.Graph object.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.