QUICK REVIEW

[论文解读] Unifying the Perspectives of NLP and Software Engineering: A Survey on Language Models for Code

Ziyin Zhang, Kai Chen|arXiv (Cornell University)|Nov 14, 2023

Semantic Web and Ontologies被引用 17

一句话总结

本论文评估面向代码的语言模型领域，将自然语言处理与软件工程视角结合，提供分类法、任务覆盖范围，以及对代码特定特征与未来方向的见解。

ABSTRACT

In this work we systematically review the recent advancements in software engineering with language models, covering 70+ models, 40+ evaluation tasks, 180+ datasets, and 900 related works. Unlike previous works, we integrate software engineering (SE) with natural language processing (NLP) by discussing the perspectives of both sides: SE applies language models for development automation, while NLP adopts SE tasks for language model evaluation. We break down code processing models into general language models represented by the GPT family and specialized models that are specifically pretrained on code, often with tailored objectives. We discuss the relations and differences between these models, and highlight the historical transition of code modeling from statistical models and RNNs to pretrained Transformers and LLMs, which is exactly the same course that had been taken by NLP. We also go beyond programming and review LLMs' application in other software engineering activities including requirement engineering, testing, deployment, and operations in an endeavor to provide a global view of NLP in SE, and identify key challenges and potential future directions in this domain. We keep the survey open and updated on GitHub at https://github.com/codefuse-ai/Awesome-Code-LLM.

研究动机与目标

促使NLP与软件工程社区在研究代码语言模型方面的更紧密结合。
提供从通用语言模型到代码专用模型的全面代码导向LLM分类法。
总结下游任务、评估基准，以及在训练与评估中使用的代码特性。
突出代码语言建模在SE工作流中的挑战、机会与未来方向。

提出的方法

提出面向代码的预训练语言模型分类法，区分通用领域的LMs、代码预训练模型与专门化架构。
讨论代码特定特征（如AST、CFG、单元测试）与从NLP借鉴的训练目标（如填充、指令微调）。
回顾五种输入/输出模态下的下游任务与评估基准（文本到代码、代码到代码、代码到文本、代码到模式、文本到文本）。
综合从统计/RNN方法到基于Transformer的LLMs在代码处理中的历史演进。
概述在SE场景中与自治代理和生产部署相关的集成与实现要点。

实验结果

研究问题

RQ1用于代码建模的不同模型家族和训练范式是什么，它们如何与NLP和SE传统相关联？
RQ2コード言語模型的典型下游任务、基准和评估指标是什么，随填充和指令微调等新能力的出现如何演变？

主要发现

该综述覆盖了50+个模型、30+个任务、170+个数据集，以及700+个相关工作，突显该领域的广度。
在代码建模方面存在从统计/RNN模型向预训练Transformer和大语言模型的历史性转变，映射了NLP的发展趋势。
代码特定特征如AST、CFG和单元测试正越来越多地被整合到代码LLM的训练与评估中。
最近的进展包括指令微调、填充目标、扩展法则、架构改进，以及在代码建模中对自治代理的研究。
在SE中的代码工程需求为现实世界的测试平台，推动LLM发展走向生产就绪。
本工作强调NLP技术与SE需求之间的持续对齐，倡导统一视角，并通过一个开源GitHub资源持续更新。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。