QUICK REVIEW

[논문 리뷰] Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

Yue Liu, Thanh Le-Cong|arXiv (Cornell University)|2023. 07. 24.

Software Engineering Research인용 수 15

한 줄 요약

이 논문은 4,066개의 ChatGPT-생성 Java 및 Python 프로그램과 2,033개의 LeetCode 문제를 체계적으로 평가하여 코드 품질 문제를 특징짓고, 정답성에 영향을 미치는 요인을 분석하며, 정적 분석 및 런타임 피드백에 의해 안내되는 프롬프트 기반 자가 수정(self-repair)을 테스트한다.

ABSTRACT

We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.

연구 동기 및 목표

작업의 난이도, 언어, 작업 연령에 따라 코드 생성에서 ChatGPT의 정답성을 평가한다.
정적 분석 및 런타임 데이터를 사용하여 ChatGPT-생성 코드에서 널리 나타나는 코드 품질 문제를 특성화한다.
정적 분석 및 런타임 피드백을 활용하여 코드 품질 문제를 수정하는 프롬프트 전략을 탐구한다.

제안 방법

공개 테스트 수트를 포함한 Python 및 Java 템플릿으로 2,033 LeetCode 문제의 시한성 벤치마크를 구성한다.
각 문제와 언어에 대해 ChatGPT로 코드를 생성한다(제로샷, 온도 0).
LeetCode 테스트 수트에 대해 pass@1로 정답성을 평가한다.
정적 분석 도구(Python: Pylint, Flake8; Java: PMD, Checkstyle)를 적용하여 코드 품질 문제를 분류한다.
오픈 카드 정렬을 사용하여 문제를 컴파일/런타임 오류, 잘못된 출력, 코드 스타일/유지보수성, 성능/효율성의 주제로 분류한다.
정적 분석/런타임 피드백 여부에 따라 수정을 요청하는 자가 수정 프롬프트를 테스트하여 자가 수정 능력을 평가한다.

실험 결과

연구 질문

RQ1RQ1: 프로그래밍 과제에 대한 코드 생성에서 ChatGPT의 효과는 얼마나 되는가?
RQ2RQ2: ChatGPT가 생성한 코드에서 일반적으로 나타나는 문제는 무엇인가?
RQ3RQ3: 프롬프트를 통해 ChatGPT가 코드 품질 문제를 수정할 수 있는가?

주요 결과

Python의 ChatGPT-생성 프로그램 중 66%, Java의 경우 69%가 기능적으로 올바르다(모든 테스트 케이스를 통과).
통과 코드에서도 코드 품질 문제는 나타나며, Java의 53%, Python의 37%의 통과 코드에서 스타일/유지보수성 문제가 보인다.
ChatGPT는 정적 분석 및 런타임 오류의 피드백을 사용하여 일부 문제를 수정할 수 있으며, 언어와 문제 유형에 따라 효과가 달라진다.
정적 분석에 따르면 Java의 47%, Python의 63%의 통과 작업이 깔끔한 코드를 가지며, 난이도가 올라갈수록 깔끔함은 감소한다.
전반적으로 4,066개의 생성 스니펫 중 1,930개가 코드 스타일/유지보수 이슈를 보이고, 1,082개가 잘못된 출력 것을 보인다.
본 연구는 AI 기반 코드 생성을 개선하기 위한 로드맵과 데이터셋 및 재현 패키지를 공개한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.