QUICK REVIEW

[论文解读] Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Ahmad Mohsin, Helge Janicke|arXiv (Cornell University)|Jun 18, 2024

Software Engineering Research被引用 7

一句话总结

本文提出一个在上下文学习（ICL）安全框架，用以在 C++、C# 和 Python 上训练和评估四种多样化的 LLMs（PDCGs 与 CCPs），利用安全模式与 SAST/代码评审来评估并改进安全代码生成。

ABSTRACT

Large Language Models (LLMs) such as ChatGPT and GitHub Copilot have revolutionized automated code generation in software engineering. However, as these models are increasingly utilized for software development, concerns have arisen regarding the security and quality of the generated code. These concerns stem from LLMs being primarily trained on publicly available code repositories and internet-based textual data, which may contain insecure code. This presents a significant risk of perpetuating vulnerabilities in the generated code, creating potential attack vectors for exploitation by malicious actors. Our research aims to tackle these issues by introducing a framework for secure behavioral learning of LLMs through In-Content Learning (ICL) patterns during the code generation process, followed by rigorous security evaluations. To achieve this, we have selected four diverse LLMs for experimentation. We have evaluated these coding LLMs across three programming languages and identified security vulnerabilities and code smells. The code is generated through ICL with curated problem sets and undergoes rigorous security testing to evaluate the overall quality and trustworthiness of the generated code. Our research indicates that ICL-driven one-shot and few-shot learning patterns can enhance code security, reducing vulnerabilities in various programming scenarios. Developers and researchers should know that LLMs have a limited understanding of security principles. This may lead to security breaches when the generated code is deployed in production systems. Our research highlights LLMs are a potential source of new vulnerabilities to the software supply chain. It is important to consider this when using LLMs for code generation. This research article offers insights into improving LLM security and encourages proactive use of LLMs for code generation to ensure software system safety.

研究动机与目标

解决因在公开代码库与不安全样例上训练而导致的 LLM 生成代码的安全风险。
开发并评估 In-Context Learning (ICL) 安全模式，以教授 LLM 安全编码行为。
在多种语言中比较基于提示的代码生成器（PDCGs）和编码 copilots（CCPs）。
创建并分析面向 LLM 生成代码的数据集与安全评估工作流。
提供洞见和数据集，推动 AI 辅助开发中的安全代码生成。

提出的方法

在 C++、C# 和 Python 中策划多样化的编程问题集，包括 DS&A、API 开发和 MVC 设计模式。
开发 In-Context Learning (ICL) 安全模式，结合零-shot、一-shot 和少量样本学习情景，融入 chain-of-thought 推理。
在四个 LLMs（PDCGs: ChatGPT-4, Google Bard; CCPs: GitHub Copilot, Amazon Code Whisperer）上针对问题生成代码。
使用 Static Application Security Testing (SAST) 与人工安全评审来评估生成的代码，识别漏洞和隐藏的代码气味（code smells）。
在源代码层面计算安全风险指标，并分析在 ICL 之后代码气味的持久性。
发布一个安全指令数据集，以支持未来的 LLM 安全训练。

Figure 1 : The LLM code generation process: simplified Transformer architecture for code generation with potential security risks

实验结果

研究问题

RQ1RQ1: 在零-shot 情景下，不同的 LLMs 能在各种编程挑战中生成多安全代码吗？
RQ2RQ2: 经过 ICL 安全模式的一 shot 与 few-shot 学习后，LLMs 是否能够理解并应用最佳实践、解决漏洞？
RQ3RQ3: 在 ICL 下，PDCG LLMs（ChatGPT-4、Google Bard）与 CCP LLMs（GitHub Copilot、Code Whisperer）在生成安全代码方面有何差异？
RQ4RQ4: 在使用 ICL 安全模式后，哪些安全代码气味仍然存在，相关风险是什么？

主要发现

基于 ICL 的一-shot 和 few-shot 学习模式可以提升代码安全性并在多种编程场景中减少漏洞。
对四种不同的 LLM 在三种语言上的评估揭示了 PDCGs 与 CCPs 在安全代码生成方面的差异。
安全性测试将 SAST 工具与人工代码评审结合，以识别漏洞和隐藏的代码气味。
开发了一个安全风险评估框架，用以量化代码级别的安全威胁。
研究提出发布一个安全指令数据集，以指导未来的安全代码生成研究。

Figure 2 : Prompt Driven Code Generators with BLLMs

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。