QUICK REVIEW

[논문 리뷰] Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation

Zhiqiang Yuan, Junwei Liu|arXiv (Cornell University)|2023. 08. 02.

Software Engineering Research인용 수 18

한 줄 요약

이 연구는 네 가지 코드 관련 작업(결함 탐지, 클론 탐지, 주장 생성, 코드 요약)에 대해 10개의 오픈 소스 지시 학습 LLM을 평가하고 제로샷, 파지샷, 그리고 미세 조정 설정에서의 성능을 분석하여 제로샷에서의 강한 성능과 소수-shot 변동성 및 비용 함의를 드러낸다.

ABSTRACT

In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension and generation tasks. We have the following main findings. First, for the zero-shot setting, instructed LLMs are very competitive on code comprehension and generation tasks and sometimes even better than small SOTA models specifically fine-tuned on each downstream task. We also find that larger instructed LLMs are not always better on code-related tasks. Second, for the few-shot setting, we find that adding demonstration examples substantially helps instructed LLMs perform better on most code comprehension and generation tasks; however, the examples would sometimes induce unstable or even worse performance. Furthermore, we find widely-used BM25-based shot selection strategy significantly outperforms the basic random selection or fixed selection only on generation problems. Third, for the fine-tuning setting, we find that fine-tuning could further improve the model performance on downstream code comprehension and generation tasks compared to the zero-shot/one-shot performance. In addition, after being fine-tuned on the same downstream task dataset, instructed LLMs outperform both the small SOTA models and similar-scaled LLMs without instruction tuning. Based on our findings, we further present practical implications on model and usage recommendation, performance and cost trade-offs, and future direction.

연구 동기 및 목표

코드 관련 작업에서 지시 학습 LLM의 제로샷 일반화 평가.
코드 작업에 대한 few-shot 인-context 학습 및 샷 선택 전략 평가.
다운스트림 코드 이해 및 생성 작업에 대한 미세 조정의 영향 분석.
코드 인텔리전스에서 모델 선택, 비용-성능 트레이드오프, 향후 방향에 대한 실용적 지침 제공.

제안 방법

표준화된 프롬프트를 사용하여 4개 코드 작업에서 10개의 오픈 소스 지시 LLM(6B–16B) 비교.
세 가지 설정 사용: 제로샷, 원샷(세 가지 샷 선택 전략과 함께), LoRA를 이용한 작업별 미세 조정.
결함 탐지, 클론 탐지, 주장 생성, 코드 요약에 대한 작업별 프롬프트 적용.
작업에 적합한 지표(정확도, F1, 정확 일치)로 성능 측정하고 코드 요약 평가에 대해 ChatGPT를 심판으로 사용.
미세 조정 및 추론 동안의 메모리 및 시간 비용 평가.
데이터셋(train/val/test) 샘플링 방식 및 모델별 표준 프롬프트 설계 도입.

실험 결과

연구 질문

RQ1RQ1: 제로샷 설정에서 지시 학습 LLM이 코드 이해 및 생성 작업에서 어떤 성능을 보이나?
RQ2RQ2: few-shot 설정에서 지시 학습 LLM의 성능은 어떠하며 샷 선택 전략의 영향은 무엇인가?
RQ3RQ3: 다운스트림 태스크에 추가 미세 조정 후의 성능은 어떠한가?
RQ4RQ4: 미세 조정 및 추론 동안 지시 학습 LLM 사용의 메모리 및 시간 비용은 무엇인가?

주요 결과

모델	DD (%)	CD (%)	AG (%)	CS (%)
CodeGen-6B	0.3	1.4	0.0	0.0
ChatGLM-6B	7.1	17.5	1.7	45.0
Vicuna-7B	54.0	13.2	10.1	48.0
Alpaca-7B	45.8	22.1	5.3	32.0
Dolly-7B	33.1	21.3	1.9	12.0
StableLM-7B	44.3	24.3	1.1	30.0
CodeAlpaca-7B	51.9	1.4	4.4	9.0
Dolly-12B	33.8	23.5	1.0	5.0
Vicuna-13B	49.8	14.1	12.0	63.0
WizardCoder-15B	54.4	23.8	19.4	71.0
Instruct-CodeGen-16B	47.8	14.2	8.4	9.0

제로샷에서 지시 학습 LLM은 여러 작업에서 소형 SOTA 모델과 경쟁하거나 이를 능가하며, 더 큰 모델 사이즈가 반드시 더 나은 제로샷 성능을 보장하지는 않는다.
Few-shot은 데모로부터의 전반적 성능 향상을 보여주지만 입력 길이가 길어지면 불안정성 및 성능 저하를 유발할 수 있다; BM25 기반 샷 선택은 생성 작업에 유익하지만 분류 작업에서 현저히 우수하지는 않다.
LoRA를 통한 미세 조정은 작업 성능을 추가로 향상시키며, 미세 조정된 지시 LL M은 소형 SOTA 모델 및 비지시 학습과 유사한 규모의 모델보다 더 우수한 성능을 보인다.
동일 규모의 모델 간에 메모리 비용이 항상 소형 SOTA 모델보다 높지 않지만, 미세 조정 및 추론 모두에서 시간 비용은 상당히 커질 수 있다.
본 연구는 코드 관련 작업에 대한 모델 선택, 샷 전략 및 비용-성능 트레이드오프에 대한 실용적인 지침을 제공한다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.