QUICK REVIEW

[論文レビュー] What's Wrong with Your Code Generated by Large Language Models? An Extensive Study

Shihan Dou, Haoxiang Jia|arXiv (Cornell University)|Jul 8, 2024

Hate Speech and Cyberbullying Detection被引用数 6

ひとこと要約

本論文は、7つのLLMが生成するコードをベンチマーク全体で経験的に分析し、バグの分類語彙を構築し、実世界のベンチマークを作成し、ファインチューニングなしでバグを修正する自己批評法を提案する。

ABSTRACT

The increasing development of LLMs in code generation has drawn significant attention among researchers. To enhance LLM-based code generation ability, current efforts are predominantly directed towards collecting high-quality datasets and leveraging diverse training technologies. However, there is a notable lack of comprehensive studies examining the limitations and boundaries of existing methods. To bridge this gap, we conducted an extensive empirical study evaluating the performance of three leading closed-source LLMs and six popular open-source LLMs on three commonly used benchmarks. Our investigation, which evaluated the length, cyclomatic complexity and API number of the generated code, revealed that these LLMs face challenges in generating successful code for more complex problems, and tend to produce code that is shorter yet more complicated as compared to canonical solutions. Additionally, we developed a taxonomy of bugs for incorrect codes that includes three categories and ten sub-categories, and analyzed the root cause for common bug types. To better understand the performance of LLMs in real-world projects, we also manually created a real-world benchmark RWPB. We analyzed bugs on RWPB to highlight distinct differences in bug distributions between actual scenarios and existing benchmarks. Finally, we propose a novel training-free iterative method that introduces self-critique, enabling LLMs to critique and correct their generated code based on bug types and compiler feedback. Our comprehensive and extensive study provides insights into the current limitations of LLM-based code generation and opportunities for enhancing the accuracy and quality of the generated code.

研究の動機と目的

Pythonタスクにおいて、主要なクローズドソースおよびオープンソースLLMが生成するコードの正確性と特性を評価する。
生成コード中のバグの種類と分布を特徴づける。
標準ベンチマークと実世界の手動でキュレーションされたベンチマーク（RWPB）との性能を比較する。
バグを緩和し合格率を向上させる訓練不要の自己批評アプローチを提案する。

提案手法

HumanEval+、MBPP+、APPS+からの1,164問のプログラミング問題に対して、7つのLLM（3つのクローズドソース、4つのオープンソース）を評価する。
生成コードの長さ、循環経路の複雑さ、およびAPIの使用を測定し、標準解と比較する。
3つの主要タイプと12のサブタイプに分類するため、スクリプトベースの初期分類と手動での精査を組み合わせた2段階のバグ注釈プロセスを開発する。
実世界のバグ分布をベンチマークと比較するため、GitHubの140のタスクから実世界ベンチマーク（RWPB）を構築する。
バグ分類とコンパイラのフィードバックに基づいて、LLMが自らのコードを批評・修正する自己批評の反復法を、追加訓練なしで導入する。
合格率の改善を報告し、タスクの複雑さがLLMの性能に与える影響を分析する。）

実験結果

リサーチクエスチョン

RQ1RQ1: コード生成におけるLLMの有効性はどの程度で、タスクの複雑さは性能にどう影響するか？
RQ2RQ2: ベンチマーク全体でのLLM生成コードにおけるバグの根本原因と分布は何か？
RQ3RQ3: データリークを最小化する実世界ベンチマークをどう構築し得るか、実世界のバグはベンチマークのバグとどう比較されるか？
RQ4RQ4: 訓練不要の自己批評アプローチはバグの緩和と生成コードの正確性の向上に寄与するか？

主な発見

クローズドソースLLMはオープンソースより優れており、特に複雑なタスクで顕著（GPT-4とClaude-3が上位、Phi-3は遅れ）。
生成コードは標準解と比較して短い傾向があるが、循環的複雑さは高く、APIの使用は類似している。
不正確なコードには正解コードよりコメントが多いことが多く、コメントは正確さより複雑さと相関することを示唆している。
機能的バグが主な問題で、構文・実行時バグも存在する。複雑な問題はタイムアウトや非最適なアルゴリズムにつながる。
実世界ベンチマークの結果、Claude-3が45.7％の正確度、Phi-3が22％を達成、RWPBではベンチマークと異なるバグ分布。
自己批評法は追加訓練なしで2回の反復後に合格率を29.2％向上させる。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。