QUICK REVIEW

[논문 리뷰] Is Your Code Generated by ChatGPT Really Correct? Rigorous Evaluation of Large Language Models for Code Generation

Jiawei Liu, Chunqiu Steven Xia|arXiv (Cornell University)|2023. 05. 02.

Software Engineering Research인용 수 171

한 줄 요약

요약: 이 논문은 EvalPlus를 제시하는데, 대규모 형식 인식 테스트 입력으로 코드 생성을 대폭 보강한 자동 프레임워크를 통해 LLM이 생성한 코드의 기능적 정확성을 엄밀히 평가하고, 기존 벤치마크에서의 오류를 과소평가하고 있음을 드러낸다.

ABSTRACT

Program synthesis has been long studied with recent approaches focused on directly using the power of Large Language Models (LLMs) to generate code. Programming benchmarks, with curated synthesis problems and test-cases, are used to measure the performance of various LLMs on code synthesis. However, these test-cases can be limited in both quantity and quality for fully assessing the functional correctness of the generated code. Such limitation in the existing benchmarks begs the following question: In the era of LLMs, is the code generated really correct? To answer this, we propose EvalPlus -- a code synthesis evaluation framework to rigorously benchmark the functional correctness of LLM-synthesized code. EvalPlus augments a given evaluation dataset with large amounts of test-cases newly produced by an automatic test input generator, powered by both LLM- and mutation-based strategies. While EvalPlus is general, we extend the test-cases of the popular HumanEval benchmark by 80x to build HumanEval+. Our extensive evaluation across 26 popular LLMs (e.g., GPT-4 and ChatGPT) demonstrates that HumanEval+ is able to catch significant amounts of previously undetected wrong code synthesized by LLMs, reducing the pass@k by up-to 19.3-28.9%. We also surprisingly found that test insufficiency can lead to mis-ranking. For example, both WizardCoder-CodeLlama and Phind-CodeLlama now outperform ChatGPT on HumanEval+, while none of them could on HumanEval. Our work not only indicates that prior popular code synthesis evaluation results do not accurately reflect the true performance of LLMs for code synthesis, but also opens up a new direction to improve such programming benchmarks through automated testing. We have open-sourced our tools, enhanced datasets as well as all LLM-generated code at https://github.com/evalplus/evalplus to facilitate and accelerate future LLM-for-code research.

연구 동기 및 목표

LLM 생성 코드의 정확성을 정확하게 측정하는 기존 코드 생성 벤치마크의 한계를 식별한다.
다양하고 유효한 테스트 입력을 자동으로 생성하고 HumanEval과 같은 벤치마크를 보강하여 더 강력한 평가를 위한 EvalPlus를 개발한다.
26개의 LLM에 걸친 통과 비율(pass@k) 메트릭에 강화된 테스트 스위트가 어떤 영향을 미치는지 시연하고 순위 변동을 밝힌다.
일반 벤치마크의ground-truth 정확성 문제를 강조하고 자동화된 테스트를 활용한 개선된 평가를 제안한다.

제안 방법

기존 코드 벤치마크를 대규모로 자동 생성된 테스트 입력으로 보강하며, LLM 기반 시드 생성과 돌연변이 기반 다양화를 사용한다.
고품질 시드로부터 다수의 유효 입력을 만들기 위해 형식 인식 돌연변이를 사용한다.
정확성의 오라클로서 ground-truth 구현에 대해 차등 테스트를 수행한다.
적은 테스트 수로도 테스트의 효과를 유지하기 위해 집합 커버링을 통한 테스트 스위트 축소를 적용한다.
작업에 프로그램 계약을 주석으로 달아 잘못된 입력을 걸러내고 테스트 생성을 안내한다.
평가 신뢰성을 확보하기 위해 ground-truth 해답을 재구현하고 검증한다.

실험 결과

연구 질문

RQ1현재의 코드 생성 벤치마크가 잘못된 LLM 생성 코드의 탐지에 얼마나 적합한가?
RQ2대규모이고 다양한 테스트 케이스(HumanEval +)를 추가하면 원래의 HumanEval에 비해 다양한 LLM의 측정 성능(pass@k)이 달라지는가?
RQ3자동화된 테스트 입력 생성을 통해 표준 벤치마크에서 보이지 않는 LLM 간 순위 변동을 드러낼 수 있는가?
RQ4테스트 스위트 축소가 평가 효과와 비용에 어떤 영향을 미치는가?
RQ5인기 벤치마크의 ground-truth 해결책에 결함이 있으며 자동화된 테스트로 이를 밝혀낼 수 있는가?

주요 결과

HumanEval +는 많은 모델에서 Base HumanEval에 비해 pass@k를 크게 감소시킨다(여러 k 값에서 최대 19.3-28.9%).
HumanEval +에서의 평가가 상대 순위를 바꾸며, 일부 오픈 소스 모델이 HumanEval +에서 ChatGPT보다 앞설 수 있지만 HumanEval에서는 그렇지 않다.
HumanEval의 코드 ground-truth에 약 11%의 문제를 포함한 결함이 있으며, 이는 EvalPlus에 의해 드러났다.
테스트 스위트 축소는 축소 전략에 따라 약 47배의 테스트 감소로도 유사한 효과를 낼 수 있다(HumanEval + 미니의 경우).
낮은 온도는 작은 k에 도움이 되고, 높은 온도는 큰 k에 도움이 되며, 최적의 온도는 HumanEval + 이전/이후에 비교적 안정적이지만 일부 모델에서 여전히 순위 변화가 나타난다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.