[논문 리뷰] WizardCoder: Empowering Code Large Language Models with Evol-Instruct
WizardCoder는 코드 중심의 Evol-Instruct로 Code LLM(StudentCoder? 주의: 원문에 있는 StarCoder를 그대로 유지하려면 'StarCoder'이지만 여기로는 제목이 없으므로 원문 tldr에 있는 텍스트를 한국어로 옮깁니다. 다만 'StarCoder' 등 고유명사는 원문대로 유지합니다.)
Code Large Language Models (Code LLMs), such as StarCoder, have demonstrated exceptional performance in code-related tasks. However, most existing models are solely pre-trained on extensive raw code data without instruction fine-tuning. In this paper, we introduce WizardCoder, which empowers Code LLMs with complex instruction fine-tuning, by adapting the Evol-Instruct method to the domain of code. Through comprehensive experiments on four prominent code generation benchmarks, namely HumanEval, HumanEval+, MBPP, and DS-1000, we unveil the exceptional capabilities of our model. It surpasses all other open-source Code LLMs by a substantial margin. Moreover, our model even outperforms the largest closed LLMs, Anthropic's Claude and Google's Bard, on HumanEval and HumanEval+. Our code, model weights, and data are public at https://github.com/nlpxucan/WizardLM
연구 동기 및 목표
- Motivate enhancement of Code LLMs with fine-grained instruction tuning tailored to code tasks.
- Leverage Evol-Instruct to generate more complex, diverse, code-focused instruction data.
- Show that enhanced instruction fine-tuning improves code generation benchmarks over baselines.
제안 방법
- Adapt Evol-Instruct to code by refining prompts, adding code-specific constraints (debugging, time/space complexity).
- Start from StarCoder 15B and evolve Code Alpaca data to ~78k samples.
- Fine-tune StarCoder on evolved data with 200 steps, batch size 512, sequence length 2048, learning rate 2e-5, fp16.
- Evaluate using HumanEval, HumanEval+, MBPP, and DS-1000 with greedy decoding and standard prompts.
- Compare against open-source and closed-source baselines to assess pass@1 and DS-1000 scores.
실험 결과
연구 질문
- RQ1How does code-focused Evol-Instruct affect Code LLM performance on standard benchmarks?
- RQ2Does WizardCoder close the gap to closed-source models in code generation tasks?
- RQ3What is the impact of the number of data evolution rounds on pass@1 performance?
주요 결과
| 모델 | 매개변수 | HumanEval | MBPP |
|---|---|---|---|
| LaMDA | 137B | 14.0 | - |
| AlphaCode | 1.1B | 17.1 | - |
| PaLM | 540B | 26.2 | 36.8 |
| PaLM-Coder | 540B | 36.0 | 47.0 |
| PaLM 2-S | - | 37.6 | 50.0 |
| Codex | 2.5B | 21.4 | - |
| Codex | 12B | 28.8 | - |
| Code-Cushman-001 | - | 33.5 | 45.9 |
| Code-Davinci-002 | - | 47.0 | 58.1 |
| GPT-3.5 | - | 48.1 | - |
| GPT-4 | - | 67.0 | - |
| LLaMA | 33B | 21.7 | 30.2 |
| LLaMA | 65B | 23.7 | 37.7 |
| CodeGen-Multi | 16B | 18.3 | 20.9 |
| CodeGen-Mono | 16B | 29.3 | 35.3 |
| CodeGeeX | 13B | 22.9 | 24.4 |
| StarCoder | 15B | 33.6 | 43.6 * |
| CodeT5+ | 16B | 30.9 | - |
| InstructCodeT5+ | 16B | 35.0 | - |
| WizardCoder | 15B | 57.3 (+22.3) | 51.8 (+8.2) |
- WizardCoder achieves SOTA among open-source Code LLMs on four benchmarks (HumanEval, HumanEval+, MBPP, DS-1000).
- On HumanEval, pass@1 improves by +22.3 points (57.3 vs 35.0) over the baseline open-source model.
- On MBPP, pass@1 improves by +8.2 points (51.8 vs 43.6) over the baseline.
- WizardCoder outperforms Claude and Bard on HumanEval and HumanEval+ despite smaller size.
- A three-round Evol-Instruct data evolution yielded the highest pass@1 on HumanEval, guiding data selection.
- WizardCoder shows strong DS-1000 performance across most libraries.
더 나은 연구,지금 바로 시작하세요
연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.
카드 등록 없음 · 무료 플랜 제공
이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.