QUICK REVIEW

[論文レビュー] A Preliminary Analysis on the Code Generation Capabilities of GPT-3.5 and Bard AI Models for Java Functions

Giuseppe Destefanis, Silvia Bartolucci|arXiv (Cornell University)|May 16, 2023

Scientific Computing and Data Management被引用数 17

ひとこと要約

本論文は、CodingBatの関数説明からJavaコードを生成する際のGPT-3.5とBardを比較する。GPT-3.5は約90.6%の正確性を達成し、Bardは53.1%を達成した。

ABSTRACT

This paper evaluates the capability of two state-of-the-art artificial intelligence (AI) models, GPT-3.5 and Bard, in generating Java code given a function description. We sourced the descriptions from CodingBat.com, a popular online platform that provides practice problems to learn programming. We compared the Java code generated by both models based on correctness, verified through the platform's own test cases. The results indicate clear differences in the capabilities of the two models. GPT-3.5 demonstrated superior performance, generating correct code for approximately 90.6% of the function descriptions, whereas Bard produced correct code for 53.1% of the functions. While both models exhibited strengths and weaknesses, these findings suggest potential avenues for the development and refinement of more advanced AI-assisted code generation tools. The study underlines the potential of AI in automating and supporting aspects of software development, although further research is required to fully realize this potential.

研究の動機と目的

CodingBat.com から取得した関数説明を用いて、GPT-3.5とBardがJavaコードを生成する能力を評価する。
CodingBat.comのリアルタイムテストシステムを用いて生成コードの正確さを評価する。
各モデルが得意とするカテゴリと苦戦するカテゴリを特定し、AI支援コード生成の指針とする。

提案手法

five CodingBat sections (Warmup, String-3, Array-3, Functional-2, Recursion-2) から64のJava関数説明を収集する。
各説明を用いてGPT-3.5とBardにJavaコードの生成を促す。
CodingBat.comのテストケースを用いて生成コードの正確さを評価する。
同じ関数説明に対するモデルの成功率を比較するためにMcNemar検定を適用する。

実験結果

リサーチクエスチョン

RQ1関数説明から正しいJavaコードを生成する際、GPT-3.5とBardはどのように比較されるか？
RQ2GPT-3.5は問題カテゴリ全体で一貫してBardを上回るのか、統計的に有意な差はあるのか？
RQ3どの問題カテゴリが両モデルにとって最も難しいか？

主な発見

GPT-3.5は64件の説明の約90.6%に正しいコードを生成した。Bardは53.1%であった。
GPT-3.5は5つの問題カテゴリのうち4つでBardを上回った。
Bardはより複雑なカテゴリ（特にString-3、Array-3、Recursion-2）で苦戦し、Functional-2のすべての説明に対して正しいコードを生成した。
両モデルはFunctional-2のすべてのタスクを正しくコード化した。一方、他のカテゴリではどちらのモデルもすべての問題を正解とできなかった。
McNemar検定は成功率に統計的に有意な差があることを示し、GPT-3.5がBardを上回った（p = 0.0001768）。
うまくいかなかった例は、Bardの誤った解法と比較してGPT-3.5の正しいアプローチを示す（例: front3）。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。