QUICK REVIEW

[论文解读] Granite Code Models: A Family of Open Foundation Models for Code Intelligence

Mayank Mishra, Matt Stallone|arXiv (Cornell University)|May 7, 2024

Software Engineering Research被引用 10

一句话总结

Granite Code Models 是一系列开源解码器式代码大模型（3B–34B），在116种编程语言上训练，在代码生成、修复、解释等方面实现了强劲的开源模型性能，并在 Apache 2.0 许可下用于研究和商业用途发布。

ABSTRACT

Large Language Models (LLMs) trained on code are revolutionizing the software development process. Increasingly, code LLMs are being integrated into software development environments to improve the productivity of human programmers, and LLM-based agents are beginning to show promise for handling complex tasks autonomously. Realizing the full potential of code LLMs requires a wide range of capabilities, including code generation, fixing bugs, explaining and documenting code, maintaining repositories, and more. In this work, we introduce the Granite series of decoder-only code models for code generative tasks, trained with code written in 116 programming languages. The Granite Code models family consists of models ranging in size from 3 to 34 billion parameters, suitable for applications ranging from complex application modernization tasks to on-device memory-constrained use cases. Evaluation on a comprehensive set of tasks demonstrates that Granite Code models consistently reaches state-of-the-art performance among available open-source code LLMs. The Granite Code model family was optimized for enterprise software development workflows and performs well across a range of coding tasks (e.g. code generation, fixing and explanation), making it a versatile all around code model. We release all our Granite Code models under an Apache 2.0 license for both research and commercial use.

研究动机与目标

激励对高效、面向企业的代码大模型的需求，具备超越代码生成的广泛能力。
介绍一系列开放的 Granite Code 模型（Base 和 Instruct），涵盖 3B、8B、20B 和 34B 参数。
描述数据收集、模型架构、训练/指令微调，以及在多样化编码任务上的评估。

提出的方法

在 3.5–4.5T 的 code+text 令牌，覆盖 116 种语言进行两阶段预训练（阶段1：仅代码；阶段2：代码+语言数据）。
解码器仅 Transformer 架构，具有预规范化及各尺寸特定选择（RoPE、GQA/MQA、swiglu、RMSNorm/LN）。
Caudal 语言建模目标，采用 Fill-In-the-Middle (FIM) 目标与混合损失 L = alpha*L_CLM + (1-alpha)*L_FIM，预训练阶段 alpha=0.5，指令微调阶段 alpha=1。
使用 CommitPack 过滤数据、NL-instruction 数据集、MathInstruct/MetaMathQA，以及合成代码数据集进行指令微调，以提升推理能力和对指令的遵循。
广泛基准测试（HumanEvalPack、MBPP(+)、RepoBench、ReCode 等），覆盖多语言；多语种与推理导向的评估；与开源代码 LLMs 的比较。

Figure 1: Comparison of Granite-8B-Code (Base/Instruct) with other open source (code) LLMs of similar size on HumanEvalPack (Muennighoff et al., 2023 ) , spanning 3 coding tasks and 6 programming languages. See Tables 3 , 10 , 11 for more details. Best viewed in color.

实验结果

研究问题

RQ1在开源模型中，Granite Code 模型能否在广泛的代码相关任务（生成、修复、解释、编辑、翻译）上达到最先进或具有竞争力的表现？
RQ2基础版本和指令微调的 Granite Code 模型在 Python 以外的多语言与基准测试中的表现如何？
RQ3哪些数据收集、筛选与训练策略能够实现可信且许可友好的企业级开放代码 LLM 的使用？
RQ4两阶段训练和指令微调是否在代码任务上提升推理和问题求解能力？
RQ5Granite Code 模型在代码相关任务中的表现与更大规模或通用开放 LLMs 如何比较？

主要发现

Granite-8B-Code-Base 在 HumanEvalPack 上比竞争对手 CodeGemma-8B 高出约 12 点（33.2% 对 21.3%），同时使用的训练令牌更少（4.5T 对 7.5T）。
Granite-8B-Code-Base 在 CodeFix/CodeExplain 上表现出色，在 HumanEvalPack 与 MultiPL-E 的多语言中具有竞争力的结果。
指令微调的 Granite Code 模型在等效规模的 CodeLlama 指令模型中表现更好，Granite-3B/8B/20B 在若干设置中超过了更大规模的 CodeLlama 变体。
在 HumanEvalSynthesize（6 种语言）上，Granite-3B-Code-Base/8B/20B-Base 在基础模型中表现最佳；甚至 3B-Instruct 有时也超越更大 CodeLlama-Instruct 模型。
在 MultiPL-E（18 种语言）中，Granite-8B-Code-Base 在 16/18 种语言上优于 CodeLlama-7B；Granite-34B-Code-Base 在许多语言中通常优于 CodeLlama-34B。
MBPP/MBPP+ 结果显示 Granite-8B-Base 具有竞争力，20B/34B 变体相比对手取得了较高分数。

Figure 2: An overview of depth upscaling (Kim et al., 2024 ) for efficient training of Granite-34B-Code. We utilize the 20B model after 1.6T tokens to start training of 34B model with the same code pretraining data without any changes to the training and inference framework.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。