QUICK REVIEW

[論文レビュー] A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

Alessio Buscemi|arXiv (Cornell University)|Aug 8, 2023

Artificial Intelligence in Healthcare and Education被引用数 16

ひとこと要約

本論文は10言語にわたる40のコーディング課題を用いて、実行可能なコードを生成するChatGPT 3.5の能力を評価し、時間、コード長、制約を分析します。

ABSTRACT

Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training using large datasets in order to understand and produce language that closely resembles that of humans. These models have reached a level of proficiency where they are capable of successfully completing university exams across several disciplines and generating functional code to handle novel problems. This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022, which has gained significant recognition for its impressive text generating and code creation capabilities. The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains. Based on the findings derived from this research, major unexpected behaviors and limitations of the model have been identified. This study aims to identify potential areas for development and examine the ramifications of automated code generation on the evolution of programming languages and on the tech industry.

研究の動機と目的

ChatGPT 3.5のコード生成能力を10のプログラミング言語で評価する。
40課題の実行成功率と時間性能を評価する。
自動コード生成のコード長、変動性、実用的な制約を分析する。
言語依存の長所、短所、および倫理・技術的な懸念を特定する。

提案手法

OpenAI API（Turbo、役割を「ソフトウェア開発者」に設定、温度1）を介してChatGPT 3.5を照会する。
DS、Games、Security、Algos のカテゴライズにまたがる固定の40課題コーパスを使用する。
各課題を10言語で、言語ごとに10回ずつ試行する（総計4,000件のテスト）。
出力を後処理してコード、テスト、言語固有のフォーマットを抽出し、結果を6つの状態に分類する。
各課題・言語ごとの所要時間を、言語全体の課題平均（P_l）に対して測定する。
コード長（LoCとNoC）を記録して、長さと変動性を評価する。

Figure 1: Status of the output generated by ChatGPT for the 4,000 tests, grouped by programming language and category.

実験結果

リサーチクエスチョン

RQ1異なるプログラミング言語で正しく実行可能なコードを生成するChatGPT 3.5の性能はどうか。
RQ2コード生成の品質と成功率に影響を与える言語依存要因（抽象度、トレーニングの人気度）は何か。
RQ3生成コードの時間的プロファイルと言語間のコード長特性はどうなるか。
RQ4自動コード生成における制約と倫理的配慮は、課題と言語ごとにどのように生じるか。

主な発見

4,000回のうち実行可能なコードを生成したのは1,833回（45.8%）、言語によって結果は異なる。
Juliaは実行成功率が最も高く81.5%、一方C++は最も低く7.3%であった。
高レベルで動的型付けの言語は、一般に低レベルで静的型付けの言語よりも良好に推移する傾向があり、トレーニングコーパスの人気度も性能に影響した。
言語によって時間性能が異なり、たとえばC++の palindromeInteger は最速で4.83秒、Cの randomForest は最も遅く140.7秒だった。
コード長（LoC/NoC）は実行時間と明確に相関せず、言語間での変動性が高いことが示された。
ChatGPT 3.5には、タスク理解の一貫性の欠如、指示への不適切な従順の事例、特定の課題における倫理的懸念など、顕著な制約が見られた。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。