QUICK REVIEW

[논문 리뷰] Program Synthesis with Large Language Models

Jacob Austin|arXiv (Cornell University)|2021. 08. 16.

Software Engineering Research참고 문헌 94인용 수 28

한 줄 요약

이 논문은 최대 137B 매개변수에 이르는 대형 Transformer 언어 모델을 두 가지 벤치마크(MBPP 및 MathQA-Python)를 사용한 Python 코드 합성에서 평가하고, few-shot 및 미세조정(regimes)에서 규모에 따른 성능 향상을 보이며, 대화 기반 인간 피드백을 탐구하고, 의미적 접지(semantic grounding) 및 실행 예측의 한계를 분석한다.

ABSTRACT

This paper explores the limits of the current generation of large language models for program synthesis in general purpose programming languages. We evaluate a collection of such models (with between 244M and 137B parameters) on two new benchmarks, MBPP and MathQA-Python, in both the few-shot and fine-tuning regimes. Our benchmarks are designed to measure the ability of these models to synthesize short Python programs from natural language descriptions. The Mostly Basic Programming Problems (MBPP) dataset contains 974 programming tasks, designed to be solvable by entry-level programmers. The MathQA-Python dataset, a Python version of the MathQA benchmark, contains 23914 problems that evaluate the ability of the models to synthesize code from more complex text. On both datasets, we find that synthesis performance scales log-linearly with model size. Our largest models, even without finetuning on a code dataset, can synthesize solutions to 59.6 percent of the problems from MBPP using few-shot learning with a well-designed prompt. Fine-tuning on a held-out portion of the dataset improves performance by about 10 percentage points across most model sizes. On the MathQA-Python dataset, the largest fine-tuned model achieves 83.8 percent accuracy. Going further, we study the model's ability to engage in dialog about code, incorporating human feedback to improve its solutions. We find that natural language feedback from a human halves the error rate compared to the model's initial prediction. Additionally, we conduct an error analysis to shed light on where these models fall short and what types of programs are most difficult to generate. Finally, we explore the semantic grounding of these models by fine-tuning them to predict the results of program execution. We find that even our best models are generally unable to predict the output of a program given a specific input.

연구 동기 및 목표

일반-purpose 대형 언어 모델이 자연어 설명으로부터 짧은 Python 프로그램을 합성하는 능력을 조사한다.
MBPP와 MathQA-Python의 두 개의 Python 코드 합성 데이터셋을 만들어 서로 다른 언어적 및 프로그래밍 도전 과제를 평가한다.
모델 규모(244M에서 137B 매개변수)에 걸친 few-shot 및 미세조정 설정에서의 성능을 평가한다.
대화 및 인간 피드백이 합성 코드의 품질 향상에 미치는 영향을 조사한다.
입력으로부터 프로그램 출력을 예측하려 시도하여 의미적 접지를 평가하고 한계를 분석한다.]
method([
tldr_

제안 방법

코드를 포함한 소스를 포함하는 대규모의 광범위한 웹 데이터로 교육된 밀집한 좌→우 디코더 전용 Transformer 언어 모델을 사용한다.
few-shot 프롬프트 및 작업 특화 데이터셋(MBPP: 374개의 미세조정 예시; MathQA-Python: 더 큰 미세조정 세트)에서의 합성을 평가한다.
온도 샘플링을 사용하여 문제당 다수 샘플을 생성하고, 코드를 실행해 기능적 정확성을 검증하며 테스트 케이스와 비교한다.
프롬프트 설계 실험은 프롬프트 내 예시의 수와 선택을 다양하게 하여 프롬프트 민감성과 프롬프트 튜닝의 잠재적 이점을 평가한다.
오류 유형(런타임, 구문, 테스트-어시트 실패)을 분석하고 모델 크기가 오류 분포에 미치는 영향을 살펴본다.
프롬프트와 사전학습 데이터 간의 중복 가능성을 고려하고 프롬프트/테스트 구성 너머의 일반화를 평가하기 위한 적대적 유사 체크를 수행한다.

실험 결과

연구 질문

RQ1대형 언어 모델은 MBPP와 MathQA-Python 전반에서 자연어 설명으로부터 Python 프로그램을 얼마나 잘 합성하는가?
RQ2모델 크기가 few-shot 및 미세조정된 프로그램 합성 성능에 어떤 영향을 미치는가?
RQ3대화와 인간 피드백이 합성 정확도를 의미 있게 향상시킬 수 없는가?
RQ4주어진 입력에 대해 프로그램 출력을 예측함으로써 모델이 코드에 대해 의미적으로 접지를 어느 정도 하는가?
RQ5생성된 프로그램은 프롬프트 테스트를 넘어서는 적대적이거나 확장된 테스트 케이스에 얼마나 강건한가?

주요 결과

합성 성능은 모델 크기에 따라 로그-선형으로 확장되며, 가장 큰 모델은 few-shot 프롬프트에서 MBPP 문제의 최대 59.6%를, 미세조정 후 MathQA-Python에서 약 83.8%를 해결한다.
미세조정은 일반적으로 MBPP에서 모델 크기 전반에 걸쳐 약 10% 포인트의 이득을, 더 광범위한 미세조정으로 MathQA-Python에서 더 큰 이득을 준다.
대화 기반 인간 피드백은 오류율을 대략 절반으로 줄여 네 차례의 상호작용에서 few-shot 성능을 약 30%에서 65%로 향상시킬 수 있다.
모델은 프롬프트 내용을 단순히 되뇌이기보다는 보류된 테스트 케이스에 일반화하는 경향이 있으며, 적대적 테스트 시나리오에서 일부 실패가 발생하지만 MBPP와의 사전학습/테스트 중복은 상대적으로 최소한이다.
가장 우수한 모델조차도 제한된 의미적 접지를 가지며 생성된 프로그램의 임의 입력에 대한 실행 결과를 신뢰성 있게 예측할 수 없으므로 구문 생성과 실제 이해 사이의 격차를 시사한다.
BLEU 점수는 합성 성공과의 상관관계가 낮고, 샘플링 전략(온도)이 성능에 큰 영향을 미치며, 엄격한 평가 예산에서는 그리디 디코딩이 더 효과적이다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.