QUICK REVIEW

[논문 리뷰] What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou, Haoxiang Jia|arXiv (Cornell University)|2024. 07. 08.

Hate Speech and Cyberbullying Detection인용 수 6

한 줄 요약

이 논문은 벤치마크에 걸쳐 일곱 개의 LLM이 생성한 코드를 실증적으로 분석하고, 버그 분류 체계를 구축하며, 실세계 벤치마크를 만들고, 미세조정 없이 버그를 수정하는 자체 비판(self-critique) 방법을 제안한다.

ABSTRACT

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.

연구 동기 및 목표

파이썬 과제에서 선도적인 클로즈드 소스 및 오픈 소스 LLM이 생성한 코드의 정확성과 특성을 평가한다.
생성된 코드의 버그 유형과 분포를 특성화한다.
표준 벤치마크와 실제 세계에서 수작업으로 선별된 벤치마크(RWPB) 간의 성능을 비교한다.
버그를 완화하고 합격률을 향상시키기 위한 훈련 없이 가능한 자체 비판(self-critique) 접근법을 제안한다.

제안 방법

HumanEval+, MBPP+, and APPS+의 1,164개 프로그래밍 문제에 대해 일곱 개의 LLM(클로즈드 소스 3개, 오픈 소스 4개)을 평가한다.
생성된 코드의 길이, 사이클로매틱 복잡성, API 사용을 측정하고 표준 솔루션과 비교한다.
버그를 3가지 기본 유형과 12개 하위 유형으로 분류하기 위해 스크립트 기반 초기 분류 체계와 수동 세부 조정이 포함된 2단계 버그 주석 처리 프로세스를 개발한다.
실세계 버그 분포를 벤치마크와 비교하기 위해 140개의 GitHub 작업으로부터 실세계 벤치마크(RWPB)를 구성한다.
추가 학습 없이 버그 분류 및 컴파일러 피드백에 따라 LLM이 자신의 코드를 비판하고 수정하는 자기 비판 반복 방법을 도입한다.
합격률 향상을 보고하고 과제의 복잡성이 LLM 성능에 어떤 영향을 미치는지 분석한다.

실험 결과

연구 질문

RQ1RQ1: 코드 생성을 위한 LLM의 효과성과 과제의 복잡성이 성능에 어떤 영향을 미치는가?
RQ2RQ2: 벤치마크 전반에서 LLM이 생성한 코드의 버그의 근본 원인과 분포는 무엇인가?
RQ3RQ3: 데이터 누수를 최소화하도록 실세계 벤치마크를 어떻게 구축할 수 있으며 실세계 버그는 벤치마크 버그와 어떻게 비교되는가?
RQ4RQ4: 훈련 없이 가능한 자체 비판 접근법이 버그를 완화하고 생성된 코드의 정합성을 향상시킬 수 있는가?

주요 결과

클로즈드 소스 LLM이 오픈 소스 LLM보다 우수하며 특히 복잡한 과제에서 더 우수하다(GPT-4와 Claude-3가 최상위, Phi-3는 뒤처진다).
생성된 코드는 표준 솔루션에 비해 길이가 짧은 경향이 있지만 사이클로매틱 복잡성은 더 높고 API 사용은 비슷하다.
오류 코드가 올바른 코드보다 주석이 더 많이 달리는 경향이 있어 주석이 정확성가 아니라 복잡성과 상관관계가 있음을 시사한다.
기능적 버그가 주된 문제이며 구문 및 런타임 버그도 존재한다; 복잡한 문제는 시간 초과나 최적이 아닌 알고리즘으로 이어진다.
실세계 벤치마크 결과 Claude-3의 정확도는 45.7%, Phi-3는 RWPB에서 22%를 달성했고 벤치마크와 다른 버그 분포를 보인다.
자체 비판 방법은 추가 학습 없이 두 차례 반복 후 합격률을 29.2% 증가시킨다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.