QUICK REVIEW

[论文解读] Code Llama: Open Foundation Models for Code

Baptiste Rozière, Jonas Gehring|arXiv (Cornell University)|Aug 24, 2023

Model-Driven Software Engineering Techniques被引用 392

一句话总结

Code Llama 是一系列开源代码基础模型（7B、13B、34B、70B），派生自 Llama 2，具备代码生成与填充、长上下文支持，以及指令遵循变体，在多个代码基准测试上实现了开源模型的最先进性能。

ABSTRACT

We release Code Llama, a family of large language models for code based on Llama 2 providing state-of-the-art performance among open models, infilling capabilities, support for large input contexts, and zero-shot instruction following ability for programming tasks. We provide multiple flavors to cover a wide range of applications: foundation models (Code Llama), Python specializations (Code Llama - Python), and instruction-following models (Code Llama - Instruct) with 7B, 13B, 34B and 70B parameters each. All models are trained on sequences of 16k tokens and show improvements on inputs with up to 100k tokens. 7B, 13B and 70B Code Llama and Code Llama - Instruct variants support infilling based on surrounding content. Code Llama reaches state-of-the-art performance among open models on several code benchmarks, with scores of up to 67% and 65% on HumanEval and MBPP, respectively. Notably, Code Llama - Python 7B outperforms Llama 2 70B on HumanEval and MBPP, and all our models outperform every other publicly available model on MultiPL-E. We release Code Llama under a permissive license that allows for both research and commercial use.

研究动机与目标

证明以 Llama 2 为起点并用以代码为主的数据进行微调，可以产生更优的开源代码模型。
引入填充能力和长上下文微调，以实现编辑器内编码和仓库规模推理。
展示语言专用化（Python）和指令遵循（Instruct）变体，并在安全性与有用性方面有所改进。
在标准代码基准测试（HumanEval、MBPP、APPS）和多语言基准测试（MultiPL-E）上进行评估。
提供在对研究和商业用途都适用的宽松许可下的模型。

提出的方法

从 Llama 2 权重初始化并在代码密集数据集上进行训练（约 ~500B token；70B 使用 ~1T）。
为 7B、13B 和 70B 变体应用联合自回归与填充预测的多任务目标。
通过 LCFT 扩展上下文长度，调整 RoPE 旋转频率以支持高达 100,000-token 的输入。
开发 Code Llama - Python 专用模型，在大量 Python 数据上进行训练。
通过使用专有指令数据进行进一步微调并结合自指令生成管线（单元测试和解答）来创建 Code Llama - Instruct。
使用零-shot 和少-shot 提示在 HumanEval、MBPP、APPS 以及多语言 MultiPL-E 基准测试上进行评估。

实验结果

研究问题

RQ1Can Code Llama outperform other open-code models on standard code benchmarks?
RQ2Does infilling training provide practical benefits with acceptable trade-offs in autocomplete quality?
RQ3How does long-context fine-tuning affect performance and extrapolate to 100k-token inputs?
RQ4What gains arise from Python specialization and instruction-following fine-tuning in terms of code generation quality, safety, and usefulness across languages?
RQ5How do the models perform in multilingual coding scenarios compared to other open models?

主要发现

Code Llama variants achieve state-of-the-art performance among open models on several benchmarks (HumanEval, MBPP, MultiPL-E).
Code Llama - Python 7B can outperform Llama 2 70B on HumanEval and MBPP in Python tasks.
Infilling-enabled models reach strong infilling benchmarks and can incur only modest drops in autoregressive generation metrics.
Long-context fine-tuning enables stable generation and extrapolation up to 100,000 tokens, with moderate impact on standard benchmarks.
Code Llama - Instruct improves safety and helpfulness benchmarks with only modest code-generation cost.
Across languages, Code Llama outperforms Llama 2 models of the same size, and Code Llama 7B competes with larger public models on multilingual tasks.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。