QUICK REVIEW

[論文レビュー] Refining ChatGPT-Generated Code: Characterizing and Mitigating Code Quality Issues

Yue Liu, Thanh Le-Cong|arXiv (Cornell University)|Jul 24, 2023

Software Engineering Research被引用数 15

ひとこと要約

本論文は、4,066件のChatGPT生成JavaおよびPythonプログラムを2,033件のLeetCode課題について系統的に評価し、コード品質の問題を特徴づけ、正確性に影響を与える要因を分析し、静的解析とランタイムフィードバックに基づくプロンプトベースの自己修復を試験する。

ABSTRACT

We systematically study the quality of 4,066 ChatGPT-generated code implemented in two popular programming languages, i.e., Java and Python, for 2,033 programming tasks. The goal of this work is three folds. First, we analyze the correctness of ChatGPT on code generation tasks and uncover the factors that influence its effectiveness, including task difficulty, programming language, time that tasks are introduced, and program size. Second, we identify and characterize potential issues with the quality of ChatGPT-generated code. Last, we provide insights into how these issues can be mitigated. Experiments highlight that out of 4,066 programs generated by ChatGPT, 2,756 programs are deemed correct, 1,082 programs provide wrong outputs, and 177 programs contain compilation or runtime errors. Additionally, we further analyze other characteristics of the generated code through static analysis tools, such as code style and maintainability, and find that 1,930 ChatGPT-generated code snippets suffer from maintainability issues. Subsequently, we investigate ChatGPT's self-repairing ability and its interaction with static analysis tools to fix the errors uncovered in the previous step. Experiments suggest that ChatGPT can partially address these challenges, improving code quality by more than 20%, but there are still limitations and opportunities for improvement. Overall, our study provides valuable insights into the current limitations of ChatGPT and offers a roadmap for future research and development efforts to enhance the code generation capabilities of AI models like ChatGPT.

研究の動機と目的

ChatGPTの正確さを、課題の難易度・言語・課題の年次とともに評価する。
ChatGPT生成コードにおける一般的なコード品質の問題を、静的解析とランタイムデータを用いて特徴づける。
静的解析とランタイムフィードバックを活用してコード品質の問題を修正するプロンプト戦略を検討する。

提案手法

PythonおよびJavaのテンプレートと公開テストスイートを用いた2,033のLeetCode課題の時系列ベンチマークを構築する。
各課題と言語ごとにChatGPT（ゼロショット、温度0）でコードを生成する。
LeetCodeのテストスイートに対するpass@1で正確性を評価する。
Python: Pylint, Flake8; Java: PMD, Checkstyleの静的解析ツールを適用してコード品質の問題を分類する。
オープンカードソーティングを用いて、問題を「コンパイル／ランタイムエラー」、「出力の誤り」、「コードのスタイル／保守性」、「性能／効率性」というテーマに分類する。
静的解析／ランタイムフィードバックを伴う有無を問う修正プロンプトをテストして、自己修復能力を評価する。

実験結果

リサーチクエスチョン

RQ1RQ1: ChatGPTはプログラミング課題のコード生成でどれくらい効果的か。
RQ2RQ2: ChatGPT生成コードの共通の問題点は何か。
RQ3RQ3: プロンプトを用いてコード品質の問題を修正できるか。

主な発見

PythonのChatGPT生成プログラムの66%とJavaの66%が機能的に正しい（すべてのテストケースをパス）。
通過コードの中でも、Javaの53%、Pythonの37%がスタイル／保守性の問題を示すコード品質の問題を抱えている。
ChatGPTは静的解析とランタイムエラーからのフィードバックを用いて問題を部分的に修正できるが、言語と問題タイプによって有効性が異なる。
静的解析は、JavaのTaskの47%、Pythonの63%のパス済みタスクでクリーンなコードを示しており、難易度が高くなるほどきれいさが低下する。
総じて、4,066件の生成スニペットのうち1,930件がコードスタイル／保守性の問題を抱え、1,082件が誤出力を示している。
本研究はAI駆動のコード生成を改善するためのロードマップを提供し、データセットと再現性パッケージを公開する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。