[论文解读] A Survey of Large Language Models for Code: Evolution, Benchmarking, and Future Trends
本文系统性地评估 Code LLMs(专门用于软件工程的语言模型)与通用 LLMs,编目 149 个 Code LLM 研究工作(筛选后为 134 个),分析性能差异,并绘制各 LLM 在不同软件工程任务中的优势映射。
General large language models (LLMs), represented by ChatGPT, have demonstrated significant potential in tasks such as code generation in software engineering. This has led to the development of specialized LLMs for software engineering, known as Code LLMs. A considerable portion of Code LLMs is derived from general LLMs through model fine-tuning. As a result, Code LLMs are often updated frequently and their performance can be influenced by the base LLMs. However, there is currently a lack of systematic investigation into Code LLMs and their performance. In this study, we conduct a comprehensive survey and analysis of the types of Code LLMs and their differences in performance compared to general LLMs. We aim to address three questions: (1) What LLMs are specifically designed for software engineering tasks, and what is the relationship between these Code LLMs? (2) Do Code LLMs really outperform general LLMs in software engineering tasks? (3) Which LLMs are more proficient in different software engineering tasks? To answer these questions, we first collect relevant literature and work from five major databases and open-source communities, resulting in 134 works for analysis. Next, we categorize the Code LLMs based on their publishers and examine their relationships with general LLMs and among themselves. Furthermore, we investigate the performance differences between general LLMs and Code LLMs in various software engineering tasks to demonstrate the impact of base models and Code LLMs. Finally, we comprehensively maintained the performance of LLMs across multiple mainstream benchmarks to identify the best-performing LLMs for each software engineering task. Our research not only assists developers of Code LLMs in choosing base models for the development of more advanced LLMs but also provides insights for practitioners to better understand key improvement directions for Code LLMs.
研究动机与目标
- 确定哪些 LLMs 是为软件工程设计的,以及它们之间的关系。
- 评估 Code LLMs 是否在软件工程任务上超过通用 LLMs。
- 编目哪些 LLMs 在不同的软件工程任务和基准测试上表现最佳。
提出的方法
- 使用预定义关键词,从四个数据库(GitHub、dblp、Google Scholar、arXiv)收集文献。
- 使用封闭式卡片分类法将论文分为相关或不相关。
- 对 134 篇论文进行筛选和追溯筛选;手动提取开发关系和性能结果。
- 使用在软件工程任务中的实验结果比较 Code LLMs 与通用 LLMs。
- 维持并分析 126 个 Code LLM 在主要基准(如 HumanEval)上的性能分数。
实验结果
研究问题
- RQ1RQ1:哪些 LLMs 是为软件工程任务设计的,它们之间的关系如何?
- RQ2RQ2:在软件工程任务中,Code LLMs 是否优于通用 LLMs?
- RQ3RQ3:哪些 LLMs 在不同的软件工程任务中表现更出色?
主要发现
- 经过 SE 任务微调的 Code LLMs 通常优于其基础模型。
- 当参数数量可比时,Code LLMs 往往优于通用 LLMs。
- 现有最先进的 Code LLMs(如 CodeFuse-CodeLlama-34B)在某些设置下可以在代码生成基准测试中超过 GPT-4,而 GPT-4 在其他任务中仍具竞争力。
- 该研究维护了 126 个 Code LLM 的性能分数,并分析它们在主要软件工程基准和任务上的表现。
- 该综述是首次对 Code LLMs 的系统评估,按开发者所属机构和任务表现进行组织。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。