QUICK REVIEW

[논문 리뷰] A Comparative Study of Code Generation using ChatGPT 3.5 across 10 Programming Languages

Alessio Buscemi|arXiv (Cornell University)|2023. 08. 08.

Artificial Intelligence in Healthcare and Education인용 수 16

한 줄 요약

본 논문은 ChatGPT 3.5의 10개 언어에 걸친 실행 가능 코드 생성을 40개 코딩 태스크를 통해 평가하고, 시간, 코드 길이, 한계점을 분석한다.

ABSTRACT

Large Language Models (LLMs) are advanced Artificial Intelligence (AI) systems that have undergone extensive training using large datasets in order to understand and produce language that closely resembles that of humans. These models have reached a level of proficiency where they are capable of successfully completing university exams across several disciplines and generating functional code to handle novel problems. This research investigates the coding proficiency of ChatGPT 3.5, a LLM released by OpenAI in November 2022, which has gained significant recognition for its impressive text generating and code creation capabilities. The skill of the model in creating code snippets is evaluated across 10 various programming languages and 4 different software domains. Based on the findings derived from this research, major unexpected behaviors and limitations of the model have been identified. This study aims to identify potential areas for development and examine the ramifications of automated code generation on the evolution of programming languages and on the tech industry.

연구 동기 및 목표

10개 프로그래밍 언어에 걸친 ChatGPT 3.5의 코드 생성 능력 평가.
40개 코딩 태스크에 대한 실행 성공률과 시간 성능 평가.
자동 코드 생성의 코드 길이, 변동성 및 실용적 제약 분석.
언어 의존적 강점, 약점 및 윤리적/기술적 우려 식별.

제안 방법

OpenAI API(Turbo, 역할을 '소프트웨어 개발자'로 설정, 온도 1)로 ChatGPT 3.5에 질의한다.
DS, Games, Security, Algos 범주를 포함하는 고정된 40태스크 코퍼스를 사용한다.
각 태스크를 10개 언어로 시험하며, 태스크당 언어별로 10회 실행한다(총 4,000회 테스트).
출력을 후처리하여 코드, 테스트 및 언어별 포맷을 추출하고 결과를 6개 상태로 분류한다.
언어별 태스크 평균(P_l)을 기준으로 태스크당 언어별 시간 측정.
LoC와 NoC를 기록하여 코드 길이와 변동성 평가.

Figure 1: Status of the output generated by ChatGPT for the 4,000 tests, grouped by programming language and category.

실험 결과

연구 질문

RQ1다양한 프로그래밍 언어에서 올바르고 실행 가능한 코드를 생성하는 ChatGPT 3.5의 성능은 어떠한가?
RQ2코드 생성 품질과 성공률에 영향을 미치는 언어 의존적 요인(추상화 수준, 학습 데이터의 인기 등)은 무엇인가?
RQ3언어 간 생성 코드의 시간 프로파일과 코드 길이 특성은 무엇인가?
RQ4작업과 언어에 걸친 자동 코드 생성에서 어떤 한계와 윤리적 고려사항이 제기되는가?

주요 결과

4000회 중 1833회(45.8%)가 실행 가능한 코드를 생성했고, 결과는 언어별로 달랐다.
Julia가 실행 성공이 가장 높았고(81.5%), C++가 가장 낮았다(7.3%).
고급 수준의 동적 타이핑 언어가 일반적으로 저수준의 정적 타이핑 언어보다 나은 성능을 보였고, 학습 코퍼스의 인기도도 성능에 영향을 주었다.
언어에 따라 시간 성능이 달랐고; 예를 들어 C++의 palindromeInteger가 가장 빨랐고(4.83 s), C의 randomForest가 가장 느렸다(140.7 s).
코드 길이(LoC/NoC)는 실행 시간과 명확한 상관관계가 없었고 언어 간 변동성이 더 큰 것으로 나타났다.
ChatGPT 3.5는 일관되지 않은 태스크 이해, 지시에 일부 불응, 특정 태스크에서의 윤리적 문제 등 주목할 만한 한계를 보였다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.