QUICK REVIEW

[논문 리뷰] CodeGen: An Open Large Language Model for Code with Multi-Turn Program Synthesis

Erik Nijkamp, Bo Pang|arXiv (Cornell University)|2022. 03. 25.

Software Engineering Research인용 수 234

한 줄 요약

CodeGen은 자연어 및 프로그래밍 데이터로 훈련된 최대 16.1B 매개변수의 오픈 소스 LLM을 공개하고, 다중 턴 프로그램 합성을 시연하며 다중 턴 프로그래밍 벤치마크(MTPB)를 도입한다.

ABSTRACT

Program synthesis strives to generate a computer program as a solution to a given problem specification, expressed with input-output examples or natural language descriptions. The prevalence of large language models advances the state-of-the-art for program synthesis, though limited training resources and data impede open access to such models. To democratize this, we train and release a family of large language models up to 16.1B parameters, called CODEGEN, on natural language and programming language data, and open source the training library JAXFORMER. We show the utility of the trained model by demonstrating that it is competitive with the previous state-of-the-art on zero-shot Python code generation on HumanEval. We further investigate the multi-step paradigm for program synthesis, where a single program is factorized into multiple prompts specifying subproblems. To this end, we construct an open benchmark, Multi-Turn Programming Benchmark (MTPB), consisting of 115 diverse problem sets that are factorized into multi-turn prompts. Our analysis on MTPB shows that the same intent provided to CODEGEN in multi-turn fashion significantly improves program synthesis over that provided as a single turn. We make the training library JAXFORMER and model checkpoints available as open source contribution: https://github.com/salesforce/CodeGen.

연구 동기 및 목표

오픈 소스 학습 라이브러리와 체크포인트를 공개하여 대규모 코드 모델에 대한 접근성을 민주화한다.
다중 턴 자연어 명세가 단일 턴 프롬프트에 비해 프로그램 합성을 개선하는지 조사한다.
모델 규모와 데이터 규모가 다중 턴 프로그램 합성 능력에 어떤 영향을 미치는지 정량적으로 분석한다.
다중 턴 합성 성능을 평가하기 위해 Multi-Turn Programming Benchmark(MTPB)를 도입하고 검증한다.

제안 방법

혼합 자연어 및 프로그래밍 언어 코퍼스(ThePile, BigQuery, BigPython)에서 자기회귀 트랜스포머를 학습한다.
순차적 학습 방식 사용: ThePile에서 CodeGen-NL, 그다음 BigQuery에서 CodeGen-Multi, 이어서 BigPython에서 CodeGen-Mono를 수행한다.
HumanEval에서 단일 턴 프로그램 합성을 평가하고 오픈 베이스라인 및 Codex-스타일 모델과 비교한다.
다중 턴 프롬프트 프레임워크를 제안하고 프롬프트와 서브프로그램을 교차 배치한 115-작업 MTPB를 구성한다.
프롬프트 이해도를 프롬프트 perplexity를 사용자 의도 이해의 대리 지표로 삼아 평가한다.
학습 라이브러리 JAXformer를 오픈 소스화하고 재현성을 위한 모델 체크포인트를 제공한다.

실험 결과

연구 질문

RQ1자연어와 코드로 학습된 대형 언어 모델이 모델과 데이터 규모가 커질수록 자연어와 코드로 학습된 출현하는 다중 턴 프로그램 합성 능력을 보일 수 있는가?
RQ2사용자 의도를 여러 개의 자연어 턴으로 분해하는 것이 단일 턴 명세와 비교하여 프로그램 합성 품질을 향상시키는가?
RQ3다중 턴 패러다임이 모델 크기와 코드-데이터 양에 따라 어떻게 성능이 달라지는가?
RQ4생성된 프로그램의 성공률에 프롬프트 perplexity가 어떤 영향을 미치는가?

주요 결과

CodeGen 모델은 파이썬 코드 생성 작업에서 오픈 소스 벤치마크 대비 경쟁력 있거나 더 우수한 성능을 보이며, 더 큰 단일 언어 파이썬 모델은 일부 Codex 변형에 접근하거나 이를 능가한다.
다중 언어 학습(CodeGen-Multi)은 자연어만 모델에 비해 상당한 향상을 보이고, 파이썬 중심의 파인튜닝(CodeGen-Mono)이 합성 성능을 더 향상시킨다.
다중 턴 프롬프트는 모델 크기에 관계없이 단일 턴 프롬프트를 이어 붙인 방식보다 합격률을 크게 향상시키며, 특히 더 어려운 문제에서 두드러진다.
프롬프트 perplexity는 성공과 상관관계를 보인다: 낮은 perplexity의 프롬프트가 더 높은 기능적 정확도를 내는 경향이 있다.
프로그램 합성 능력은 모델 크기와 데이터 크기에 따라 나타나고 확장되며, 다중 턴 코드 생성에 대한 스케일링 법칙을 시사한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.