QUICK REVIEW

[論文レビュー] An Empirical Evaluation of Using Large Language Models for Automated Unit Test Generation

Max Schäfer, Sarah Nadi|arXiv (Cornell University)|Feb 13, 2023

Software Testing and Debugging Techniques被引用数 53

ひとこと要約

本論文は TestPilot を提案する。これは追加の学習なしに JavaScript のユニットテストを生成する適応的な LLM ベースのツールで、広いカバレッジを達成し、25 個の npm パッケージにわたって多様でコピーではないテストを促進する。さらに Nessie との比較や、異なるプロンプト要素と LLM の影響を検討している。

ABSTRACT

Unit tests play a key role in ensuring the correctness of software. However, manually creating unit tests is a laborious task, motivating the need for automation. Large Language Models (LLMs) have recently been applied to this problem, utilizing additional training or few-shot learning on examples of existing tests. This paper presents a large-scale empirical evaluation on the effectiveness of LLMs for automated unit test generation without additional training or manual effort, providing the LLM with the signature and implementation of the function under test, along with usage examples extracted from documentation. We also attempt to repair failed generated tests by re-prompting the model with the failing test and error message. We implement our approach in TestPilot, a test generation tool for JavaScript that automatically generates unit tests for all API functions in an npm package. We evaluate TestPilot using OpenAI's gpt3.5-turbo LLM on 25 npm packages with a total of 1,684 API functions. The generated tests achieve a median statement coverage of 70.2% and branch coverage of 52.8%, significantly improving on Nessie, a recent feedback-directed JavaScript test generation technique, which achieves only 51.3% statement coverage and 25.6% branch coverage. We also find that 92.8% of TestPilot's generated tests have no more than 50% similarity with existing tests (as measured by normalized edit distance), with none of them being exact copies. Finally, we run TestPilot with two additional LLMs, OpenAI's older code-cushman-002 LLM and the open LLM StarCoder. Overall, we observed similar results with the former (68.2% median statement coverage), and somewhat worse results with the latter (54.0% median statement coverage), suggesting that the effectiveness of the approach is influenced by the size and training set of the LLM, but does not fundamentally depend on the specific model.

研究の動機と目的

開発者の労力を削減するためにユニットテスト作成の自動化を動機づける。
事前学習済みの LLM が微調整なしで効果的なユニットテストを生成できるか評価する。
LLM が生成したテストのカバレッジとテスト品質（アサーション、非自明なアサーション）を評価する。
プロンプト要素がテスト生成の有効性に及ぼす影響を分析する。
既存のテスト生成技術と複数の LLM に対して TestPilot を比較する。

提案手法

関数シグネチャ、ドキュメント、および使用例を含むプロンプトを用いた LLM（gpt3.5-turbo）によるプロンプトベースのテスト生成。
適応的リプロンプト：生成されたテストが失敗した場合、失敗とエラーメッセージを含む新たなプロンプトで再作成してテストを修正する。
Five-part TestPilot アーキテクチャ: API Explorer、Documentation Miner、Prompt Generator、Test Validator、そして Prompt Refiner。
実行時にパッケージを検査してテスト可能な関数を特定することで JavaScript の動的 API 発見。
Mocha ベースのテスト生成と実行により、生成テストを検証・精錬する。
Nessie との比較実験および代替 LLM（code-cushman-002 および StarCoder）との比較実験。

実験結果

リサーチクエスチョン

RQ1RQ1 TestPilot が生成するテストは、どの程度の文達成率（statement coverage）と分岐達成率（branch coverage）を達成するか？
RQ2異なる情報要素（本体、使用例、ドキュメントコメント）を除外・含有した場合、TestPilot のプロンプトはどれだけ有効か？
RQ3異なる LLM（GPT-3.5-turbo、code-cushman-002、StarCoder）に対して TestPilot はどのように機能するか？
RQ4生成されたテストは既存のテストとどの程度似ているか（記憶されたものか、トレーニングデータからコピーされたものか）？
RQ5生成されたテストには機能を実際に活用する非自明なアサーションが含まれているか？

主な発見

25 個の npm パッケージで中央値の statement coverage 70.2% および branch coverage 52.8% を達成。
比較のため Nessie は statement coverage 51.3% および branch coverage 25.6% を達成。
TestPilot のテストのうち 92.8% が既存のテストと <= 50% の類似性しかなく（正確なコピーはなし）。
テストの 60.0% は既存のテストと <= 40% の類似性しかなく（そして 92.8% は <= 50%）。
適応的リプロンプトにより、失敗したテストのおよそ 15.6% を修正。
code-cushman-002 使用時は 68.2% stat、51.2% branch、StarCoder 使用時は 54.0% stat、37.5% branch と定性的に類似の結果。
高品質なテスト生成には5つのプロンプト要素すべてが不可欠であり、いずれかの要素を削除すると有効性が低下する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。