QUICK REVIEW

[論文レビュー] Can Large Language Models Write Good Property-Based Tests?

Vasudev Vikram, Caroline Lemieux|arXiv (Cornell University)|Jul 10, 2023

Software Engineering Research被引用数 13

ひとこと要約

本論文は API ドキュメントからの性質ベース検証（PBT）を合成するために GPT-4 ベースのプロンプトを用いることを調査し、3 つの prompting 戦略とジェネレータとプロパティの品質を評価する方法論を備えた PBT-GPT を提案する。

ABSTRACT

Property-based testing (PBT), while an established technique in the software testing research community, is still relatively underused in real-world software. Pain points in writing property-based tests include implementing diverse random input generators and thinking of meaningful properties to test. Developers, however, are more amenable to writing documentation; plenty of library API documentation is available and can be used as natural language specifications for PBTs. As large language models (LLMs) have recently shown promise in a variety of coding tasks, we investigate using modern LLMs to automatically synthesize PBTs using two prompting techniques. A key challenge is to rigorously evaluate the LLM-synthesized PBTs. We propose a methodology to do so considering several properties of the generated tests: (1) validity, (2) soundness, and (3) property coverage, a novel metric that measures the ability of the PBT to detect property violations through generation of property mutants. In our evaluation on 40 Python library API methods across three models (GPT-4, Gemini-1.5-Pro, Claude-3-Opus), we find that with the best model and prompting approach, a valid and sound PBT can be synthesized in 2.4 samples on average. We additionally find that our metric for determining soundness of a PBT is aligned with human judgment of property assertions, achieving a precision of 100% and recall of 97%. Finally, we evaluate the property coverage of LLMs across all API methods and find that the best model (GPT-4) is able to automatically synthesize correct PBTs for 21% of properties extractable from API documentation.

研究の動機と目的

実運用ソフトウェアにおける性質ベース検証（PBT）の活用不足を動機づけ、ジェネレータの作成と有意義なプロパティの生成における課題を特定する。
LLM を用いて API ドキュメントから PBT コンポーネントを合成する方法（PBT-GPT）を提案する。
ジェネレータとプロパティの合成のための3つの prompting 戦略（独立、連続、そして共同）を導入する。
ジェネレータの妥当性/多様性とプロパティの妥当性/健全性/強度を評価する評価方法を開発する。
潜在的な利点と制限を示すために Python ライブラリ API（numpy、networkx、datetime）に関する予備的な結果を提供する。

提案手法

API ドキュメント、システム/ユーザー指示、 Generator、Properties、またはその両方の指定された出力形式を含むプロンプトテンプレートを設計する。
ジェネレータとプロパティを独立して prompting する方法、文脈付きで連続的に prompting する方法、共同で prompting して統合テストを生成する方法の3つの prompting 方法を定義する。
PBT-GPT の故障モードを特徴づけ、ジェネレータの品質（妥当性と多様性）とプロパティの品質（妥当性、健全性、強度）に焦点を当てた評価方法を提案する。
Hypothesis を PBT フレームワークとして用いて、サンプル Python API で PBT-GPT を実装・評価する。
ジェネレータの妥当性・多様性とプロパティの健全性を改善するための緩和戦略と人間イン・ザ・ループのアプローチについて論じる。

Figure 1 : Truncated Numpy documentation for the numpy.cumsum API method. The documentation has natural language descriptions of properties about the result shape/size and additional information about the last element of the result.

実験結果

リサーチクエスチョン

RQ1LLM は API ドキュメントから実用的な性質ベースのテストを合成できるか。
RQ2異なる prompting 戦略が生成される PBT コンポーネント（ジェネレータとプロパティ）の品質にどう影響するか。
RQ3LLM 生成の PBT における一般的な故障モードは何か、そしてそれらをどう緩和できるか。
RQ4合成された PBT のジェネレータ妥当性/多様性とプロパティ妥当性/健全性/強度をどう評価するか。
RQ5numpy、networkx、datetime API に PBT-GPT を適用した場合に観察される予備的な結果は何か。

主な発見

PBT-GPT は numpy、networkx、datetime の API ドキュメントから派生したジェネレータとプロパティで有望な予備結果を示す。
3 つの prompting 戦略（独立、連続、共同）はジェネレータとプロパティの合成の間で異なるトレードオフを提供する。
ジェネレータは妥当性と多様性の問題を示す可能性があり、プロパティは無効、健全性のない、または弱い場合があり、緩和または人間の介入による改善が必要。
ジェネレータ妥当性、ジェネレータ多様性、プロパティ妥当性、プロパティ健全性、プロパティ強度に焦点を当てた評価方法が提案され、例を用いてデモンストレーションされる。
緩和戦略には無効なジェネレータ/プロパティを修正するための継続的な prompting や、サンプルを用いて健全性と強度を向上させることが含まれる。
初期結果は、LLM 合成の PBT が開発者がテストを反復する際の有用な出発点となり得ることを示している。

Figure 4 : An example prompt for synthesizing the generator function of a networkx.Graph object.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。