QUICK REVIEW

[論文レビュー] No More Manual Tests? Evaluating and Improving ChatGPT for Unit Test Generation

Zhiqiang Yuan, Yiling Lou|arXiv (Cornell University)|May 7, 2023

Software Testing and Debugging Techniques被引用数 52

ひとこと要約

この論文は、Javaのユニットテスト生成におけるChatGPTを実証的に評価し、その強みと限界を分析し、初期のテスト生成と反復的改良を通じて正確性を向上させるChatTesterを導入します。

ABSTRACT

Unit testing is essential in detecting bugs in functionally-discrete program units. Manually writing high-quality unit tests is time-consuming and laborious. Although traditional techniques can generate tests with reasonable coverage, they exhibit low readability and cannot be directly adopted by developers. Recent work has shown the large potential of large language models (LLMs) in unit test generation, which can generate more human-like and meaningful test code. ChatGPT, the latest LLM incorporating instruction tuning and reinforcement learning, has performed well in various domains. However, It remains unclear how effective ChatGPT is in unit test generation. In this work, we perform the first empirical study to evaluate ChatGPT's capability of unit test generation. Specifically, we conduct a quantitative analysis and a user study to systematically investigate the quality of its generated tests regarding the correctness, sufficiency, readability, and usability. The tests generated by ChatGPT still suffer from correctness issues, including diverse compilation errors and execution failures. Still, the passing tests generated by ChatGPT resemble manually-written tests by achieving comparable coverage, readability, and even sometimes developers' preference. Our findings indicate that generating unit tests with ChatGPT could be very promising if the correctness of its generated tests could be further improved. Inspired by our findings above, we propose ChatTESTER, a novel ChatGPT-based unit test generation approach, which leverages ChatGPT itself to improve the quality of its generated tests. ChatTESTER incorporates an initial test generator and an iterative test refiner. Our evaluation demonstrates the effectiveness of ChatTESTER by generating 34.3% more compilable tests and 18.7% more tests with correct assertions than the default ChatGPT.

研究の動機と目的

生成されたユニットテストにおけるChatGPTの正確性、充足性、可読性、および使いやすさを評価する。
ChatGPT生成テストにおけるコンパイルエラーと実行エラーの一般的な原因を特定する。
従来のEvosuiteおよび学習ベースのAthenaTestのベースラインとChatGPTを比較する。
ChatTesterを提案し、テストの正確性を向上させ、獲得効果を示す。
ユニットテスト生成におけるChatGPTの活用に関する実用的なガイドラインを提供する。

提案手法

185のJavaプロジェクトにまたがる1,748ペアから，焦点メソッド–テストメソッドデータの1000ペアを構築する。
2部構成の基本プロンプト（自然言語のタスク説明とコードコンテキスト）を設計し、ChatGPT（gpt-3.5-turbo）に問い合わせる。
Javaパーサとテスト実行を用いて、構文・コンパイル・実行の正確性を評価し、エラー種別を分類する。
Jacocoを用いて、文のカバレッジと分岐カバレッジおよびアサーションの数で充足性を評価する。
実務でJava開発者とともに、人手で書かれたテストと比較して可読性と使いやすさを評価するユーザ研究を実施する。
正確性を向上させるための意図重視の初期テスト生成器と、反復的な検証・修正リファイナーを備えたChatTesterを開発する。

実験結果

リサーチクエスチョン

RQ1RQ1: ChatGPT生成のテストはどの程度正確か（構文、コンパイル、実行）で、共通のエラータイプは何か？
RQ2RQ2: ChatGPT生成のテストはカバレッジおよびアサーションの点で十分であるか？
RQ3RQ3: ChatGPT生成のテストは手動のテストと比較してどの程度可読か？
RQ4RQ4: 開発者は実際のプロジェクトで直接ChatGPT生成テストを使用するか？
RQ5RQ5: ChatTesterはデフォルトのChatGPTより正確性を向上させるか、またその構成要素の寄与は何か？

主な発見

手法	構文正確性 (%)	コンパイル正確性 (%)	実行正確性（合格） (%)
ChatGPT	100.0	42.1	24.8
AthenaTest	54.8	18.8	14.4
Evosuite	100.0	66.8	59.7

ChatGPT生成のテストは高い構文正確性を示す一方で、コンパイルとアサーションエラーのために実行の全体的な成功率は低い。
ChatGPTテストのうち実行をパスするのは24.8%、一方で57.9%がコンパイルエラー、17.3%が不正なアサーションにより実行時に失敗。
実行をパスしたテストでは、ChatGPTは高い文のカバレッジ（82.3%）と分岐カバレッジ（65.6%）を達成し、可読性と使いやすさは手動テストと同等である。
ChatGPT生成のテストはアサーションの数や全体構造において手動テストに似ており、正確性の問題が解決されれば実用的な可能性がある。
AthenaTestとEvosuiteはChatGPTに比べて正確性とカバレッジが低く、Evosuiteは生成後のフィルタリングによりコンパイル可能性・実行可能性が高い。
ChatTester—初期の意図ベースのテスト生成と反復的検証・修正リファイナーを含む—は、デフォルトのChatGPTより、コンパイル可能なテストを34.3%、正しいアサーションを含むテストを18.7%多く生み出す。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。