QUICK REVIEW

[論文レビュー] ClassEval: A Manually-Crafted Benchmark for Evaluating LLMs on Class-level Code Generation

Xueying Du, Mingwei Liu|arXiv (Cornell University)|Aug 3, 2023

Software Engineering Research被引用数 21

ひとこと要約

本論文は ClassEval を導入し、クラスレベルのコード生成のために手作業で構築された100タスクの Python ベンチマークを提案し、3つの生成戦略の下で11の最先端 LLM を分析します。これにより、関数レベルとクラスレベルの性能の顕著なギャップとモデル固有の戦略が明らかになります。

ABSTRACT

In this work, we make the first attempt to evaluate LLMs in a more challenging code generation scenario, i.e. class-level code generation. We first manually construct the first class-level code generation benchmark ClassEval of 100 class-level Python code generation tasks with approximately 500 person-hours. Based on it, we then perform the first study of 11 state-of-the-art LLMs on class-level code generation. Based on our results, we have the following main findings. First, we find that all existing LLMs show much worse performance on class-level code generation compared to on standalone method-level code generation benchmarks like HumanEval; and the method-level coding ability cannot equivalently reflect the class-level coding ability among LLMs. Second, we find that GPT-4 and GPT-3.5 still exhibit dominate superior than other LLMs on class-level code generation, and the second-tier models includes Instruct-Starcoder, Instruct-Codegen, and Wizardcoder with very similar performance. Third, we find that generating the entire class all at once (i.e. holistic generation strategy) is the best generation strategy only for GPT-4 and GPT-3.5, while method-by-method generation (i.e. incremental and compositional) is better strategies for the other models with limited ability of understanding long instructions and utilizing the middle information. Lastly, we find the limited model ability of generating method-dependent code and discuss the frequent error types in generated classes. Our benchmark is available at https://github.com/FudanSELab/ClassEval.

研究の動機と目的

相互依存するクラスメソッドの生成を評価するため、クラスレベルの Python コード生成に挑戦的で手作業で構築されたベンチマークを作成する。
専用のユニットテストを用いて生成されたクラスの正確性を検証するテストスイートを用意することで、高いテスト十分性を確保する。
全体的（ホリスティック）、漸進的（インクリメンタル）、構成的（組合せ）的な生成戦略の下で多様な LLM を評価し、戦略の適合性を理解する。
メソッド依存性や文脈豊かなクラスコードの生成におけるエラータイプとモデルの制約を分析する。

提案手法

高いテストカバレッジを持つ約412メソッドに相当する、クラスレベルの Python タスク100個から成る手作りの ClassEval。
クラスレベルのインポート、コンストラクタ、メソッド契約を含む契約プログラミングの原則に基づいて設計されたクラス・スケルトン。
メソッドレベルのテストとクラスレベルのテストの2段階を開発し、メソッド間の相互作用を検証する。
標準解を作成し、包括的なテストスイートに対して検証して、ハイクオリティなベースラインを保証。
正確性指標として Pass@k を用い、3つの生成戦略（ホリスティック、インクリメンタル、組成的）で11の LLM を評価。

実験結果

リサーチクエスチョン

RQ1HumanEval のような関数レベルのベンチマークと比べて、LLM はクラスレベルのコード生成でどのように性能を発揮するか？
RQ2ホリスティック、インクリメンタル、組成的な異なる生成戦略が、クラスレベルのタスクにおける LLM の性能にどのように影響するか？
RQ3LLM はクラス内の他のクラス文脈に依存するコードをどの程度生成できるか？
RQ4クラスレベルのコードを生成する際の一般的なエラータイプは何で、モデルによってどのように異なるか？

主な発見

すべての LLM は、クラスレベルの生成において、単独のメソッドレベルのベンチマークより性能が劣る。
GPT-4 と GPT-3.5 は他のモデルより優れており、Instruct-StarCoder、Instruct-CodeGen、WizardCoder が強力だが類似した競合として挙げられる。
ホリスティック生成は GPT-4 と GPT-3.5 に最適であり、インクリメンタル/組成的戦略は、長い指示や中間文脈の活用能力が限定的な他のモデルに利益をもたらす。
ClassEval の多くのメソッドは、フィールドや他のメソッドへの依存を示す（スタンドアロン vs ライブラリ/フィールド/メソッドの依存関係）。
このベンチマークは、生成されたクラスの正確性を確実に検証するため、高いテスト十分性（命題レベルおよび分岐レベルのカバレッジが98%を超える）を強調している。
ClassEval は頻出エラータイプを記録し、一部のモデルがメソッド依存のコードを生成する能力が限られていることを論じている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。