QUICK REVIEW

[Paper Review] Large Language Models as Test Case Generators: Performance Evaluation and Enhancement

Kefan Li, Yuan Yuan|arXiv (Cornell University)|Apr 20, 2024

Software Testing and Debugging Techniques7 citations

TL;DR

The paper evaluates LLMs as test case generators and introduces TestChain, a multi-agent framework with a Python interpreter, to improve accuracy and robustness of generated test cases, especially on harder problems.

ABSTRACT

Code generation with Large Language Models (LLMs) has been extensively studied and achieved remarkable progress. As a complementary aspect to code generation, test case generation is of crucial importance in ensuring the quality and reliability of code. However, using LLMs as test case generators has been much less explored. Current research along this line primarily focuses on enhancing code generation with assistance from test cases generated by LLMs, while the performance of LLMs in test case generation alone has not been comprehensively examined. To bridge this gap, we conduct extensive experiments to study how well LLMs can generate high-quality test cases. We find that as the problem difficulty increases, state-of-the-art LLMs struggle to generate correct test cases, largely due to their inherent limitations in computation and reasoning. To mitigate this issue, we further propose a multi-agent framework called \emph{TestChain} that decouples the generation of test inputs and test outputs. Notably, TestChain uses a ReAct format conversation chain for LLMs to interact with a Python interpreter in order to provide more accurate test outputs. Our results indicate that TestChain outperforms the baseline by a large margin. Particularly, in terms of the accuracy of test cases, TestChain using GPT-4 as the backbone achieves a 13.84\% improvement over the baseline on the LeetCode-hard dataset.

Motivation & Objective

Assess how well state-of-the-art LLMs generate function-level unit test cases across datasets of varying difficulty.
Identify the main error types limiting test-case correctness (e.g., Assertion Errors, Runtime Errors, Timeouts).
Propose and validate a decoupled, multi-agent architecture (TestChain) that leverages a Python interpreter to improve input-output mapping.
Analyze the impact of 0-shot vs 1-shot prompts and the role of tool-assisted reasoning in test-case generation.

Proposed method

Conduct large-scale experiments with four LLMs (StarChat, CodeLlama, GPT-3.5, GPT-4) on HumanEval-no-exp and LeetCode-no-exp datasets.
Evaluate test cases using accuracy, line coverage, and a new Code-with-Bugs (CwB) metric for HumanEval.
Analyze error types (Assertion, Runtime, Timeout) to understand failure modes.
Propose TestChain with two agents (Designer for inputs, Calculator for outputs) and a Python interpreter via a ReAct-style interaction to generate and verify test cases.
Compare TestChain against a Test Agent baseline (1-shot) and a modified TestChain variant without Python interpreter to assess the value of decoupling and tool interaction.
Report results for 0-shot vs 1-shot prompts where applicable.

Experimental results

Research questions

RQ1Can current LLMs generate high-quality test cases for function-level unit testing across easy and hard Python problems?
RQ2What are the dominant error types when LLMs generate test cases, and how do they scale with problem difficulty?
RQ3Does decoupling test input generation from output generation improve test-case accuracy and strength?
RQ4Does incorporating a Python interpreter via a ReAct-style interaction significantly enhance the correctness of generated test cases?
RQ5How do prompt strategies (0-shot vs 1-shot) interact with model capability in this task?

Key findings

Model	Method	HumanEval-no-exp Accuracy (%)	HumanEval-no-exp Line Cov (%)	HumanEval-no-exp CwB (%)	LeetCode-no-exp Accuracy (%)	LeetCode-no-exp Line Cov (%)
GPT-3.5	Test Agent (1-shot)	74.02	74.69	74.15	38.97	73.66
GPT-3.5	TestChain	80.85	77.53	80.80	48.72	80.23
GPT-4	Test Agent (1-shot)	84.63	77.04	83.11	57.95	88.47
GPT-4	TestChain	90.24	80.00	88.66	71.79	90.60

LLMs generate many test cases for easy problems but show sharp accuracy declines on hard problems like LeetCode-hard.
CwB and line coverage show varying improvements across models, with accuracy being the bottleneck for hard problems.
TestChain consistently outperforms the Test Agent (1-shot) baseline across metrics and datasets.
TestChain with GPT-4 achieves 71.79% accuracy on LeetCode-no-exp, compared to 57.95% for the baseline (1-shot).
Using a Calculator agent with Python interpreter reduces Assertion Errors and improves correctness and coverage compared to single-shot or no-interpreter setups.
A modified TestChain without Python interpreter is weaker than the full TestChain, highlighting the crucial role of tool-assisted computation.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.