QUICK REVIEW

[论文解读] Evaluating Instruction-Tuned Large Language Models on Code Comprehension and Generation

Zhiqiang Yuan, Junwei Liu|arXiv (Cornell University)|Aug 2, 2023

Software Engineering Research被引用 18

一句话总结

本研究在四个与代码相关的任务上评估了10个开源指令微调的 LLMs（缺陷检测、克隆检测、断言生成、代码摘要），覆盖零-shot、少样本和微调设置，揭示强零-shot性能以及少样本变异性和成本影响。

ABSTRACT

In this work, we evaluate 10 open-source instructed LLMs on four representative code comprehension and generation tasks. We have the following main findings. First, for the zero-shot setting, instructed LLMs are very competitive on code comprehension and generation tasks and sometimes even better than small SOTA models specifically fine-tuned on each downstream task. We also find that larger instructed LLMs are not always better on code-related tasks. Second, for the few-shot setting, we find that adding demonstration examples substantially helps instructed LLMs perform better on most code comprehension and generation tasks; however, the examples would sometimes induce unstable or even worse performance. Furthermore, we find widely-used BM25-based shot selection strategy significantly outperforms the basic random selection or fixed selection only on generation problems. Third, for the fine-tuning setting, we find that fine-tuning could further improve the model performance on downstream code comprehension and generation tasks compared to the zero-shot/one-shot performance. In addition, after being fine-tuned on the same downstream task dataset, instructed LLMs outperform both the small SOTA models and similar-scaled LLMs without instruction tuning. Based on our findings, we further present practical implications on model and usage recommendation, performance and cost trade-offs, and future direction.

研究动机与目标

评估指令微调的 LLM 在代码相关任务上的零-shot 泛化。
评估少样本的就地学习与针对代码任务的 shot 选择策略。
检验在下游任务上进一步微调对代码理解与生成任务的影响。
就模型选择、成本-性能权衡及未来在代码智能领域的方向提供实用指导。

提出的方法

使用标准化提示对 10 个开源指令 LLM（6B–16B）在四个代码任务上进行比较。
使用三种设置：零-shot、单-shot（含三种 shot 选择策略）以及使用 LoRA 进行任务特定微调。
对缺陷检测、克隆检测、断言生成和代码摘要使用任务特定提示。
使用适合任务的指标（准确率、F1、精确匹配）进行性能衡量，并将 ChatGPT 作为代码摘要评估的评审。
评估微调和推理过程中的内存与时间成本。
为数据集（训练/验证/测试）引入采样方案，并为每个模型设计标准化提示。

实验结果

研究问题

RQ1RQ1：指令微调的 LLM 在零-shot 设置下的代码理解与生成任务表现如何？
RQ2RQ2：指令微调的 LLM 在少样本设置下的表现如何， shot 选择策略的影响是什么？
RQ3RQ3：在进一步对下游任务进行微调后，指令微调的 LLM 的表现如何？
RQ4RQ4：在微调和推理过程中使用指令微调 LLM 的内存与时间成本是多少？

主要发现

模型	缺陷检测 (%)	克隆检测 (%)	断言生成 (%)	代码摘要 (%)
CodeGen-6B	0.3	1.4	0.0	0.0
ChatGLM-6B	7.1	17.5	1.7	45.0
Vicuna-7B	54.0	13.2	10.1	48.0
Alpaca-7B	45.8	22.1	5.3	32.0
Dolly-7B	33.1	21.3	1.9	12.0
StableLM-7B	44.3	24.3	1.1	30.0
CodeAlpaca-7B	51.9	1.4	4.4	9.0
Dolly-12B	33.8	23.5	1.0	5.0
Vicuna-13B	49.8	14.1	12.0	63.0
WizardCoder-15B	54.4	23.8	19.4	71.0
Instruct-CodeGen-16B	47.8	14.2	8.4	9.0

在零-shot 下，带指令的 LLM 在若干任务上与小型 SOTA 模型竞争，甚至超越；更大的模型规模并不保证更好的零-shot 性能。
少样本通过演示显示整体性能提升，但对较长输入可能导致不稳定性和性能下降；基于 BM25 的 shot 选择对生成任务有利，但在分类任务上并不明显优于其他方法。
使用 LoRA 的微调进一步提升任务性能；经过微调的带指令 LLM 在任务上超越小型 SOTA 模型以及无指令微调的同等规模模型。
在同等规模的模型中，内存成本并非总是高于小型 SOTA 模型，但在微调和推理中的时间成本可能显著更高。
该研究就代码相关任务的模型选择、shot 策略和成本-性能权衡提供了实用指南。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。