QUICK REVIEW

[논문 리뷰] Can We Trust Large Language Models Generated Code? A Framework for In-Context Learning, Security Patterns, and Code Evaluations Across Diverse LLMs

Ahmad Mohsin, Helge Janicke|arXiv (Cornell University)|2024. 06. 18.

Software Engineering Research인용 수 7

한 줄 요약

본 논문은 C++, C#, Python 전반에 걸쳐 네 가지 다양한 LLMs(PDCGs 및 CCPs)를 훈련하고 평가하기 위한 in-context learning 보안 프레임워크를 제안하며, 보안 패턴과 SAST/코드 리뷰를 활용해 보안 코드 생성을 평가하고 개선한다.

ABSTRACT

Large Language Models (LLMs) such as ChatGPT and GitHub Copilot have revolutionized automated code generation in software engineering. However, as these models are increasingly utilized for software development, concerns have arisen regarding the security and quality of the generated code. These concerns stem from LLMs being primarily trained on publicly available code repositories and internet-based textual data, which may contain insecure code. This presents a significant risk of perpetuating vulnerabilities in the generated code, creating potential attack vectors for exploitation by malicious actors. Our research aims to tackle these issues by introducing a framework for secure behavioral learning of LLMs through In-Content Learning (ICL) patterns during the code generation process, followed by rigorous security evaluations. To achieve this, we have selected four diverse LLMs for experimentation. We have evaluated these coding LLMs across three programming languages and identified security vulnerabilities and code smells. The code is generated through ICL with curated problem sets and undergoes rigorous security testing to evaluate the overall quality and trustworthiness of the generated code. Our research indicates that ICL-driven one-shot and few-shot learning patterns can enhance code security, reducing vulnerabilities in various programming scenarios. Developers and researchers should know that LLMs have a limited understanding of security principles. This may lead to security breaches when the generated code is deployed in production systems. Our research highlights LLMs are a potential source of new vulnerabilities to the software supply chain. It is important to consider this when using LLMs for code generation. This research article offers insights into improving LLM security and encourages proactive use of LLMs for code generation to ensure software system safety.

연구 동기 및 목표

공개 저장소 및 취약한 샘플 교육으로 인한 LLM 생성 코드의 보안 리스크를 해결한다.
LLMs에 보안 코딩 행동을 가르치기 위해 In-Context Learning (ICL) 보안 패턴을 개발하고 평가한다.
다양한 언어에 걸쳐 PDCGs와 CCPs를 비교한다.
LLM 생성 코드에 대한 데이터셋과 보안 평가 워크플로를 만들고 분석한다.
AI 지원 개발에서 보안 코드 생성을 진전시키기 위한 통찰과 데이터셋을 제공한다.

제안 방법

DS&A, API 개발, MVC 디자인 패턴을 포함한 C++, C#, Python의 다양한 프로그래밍 문제 세트를 선별한다.
Chain-of-thought 추론을 포함한 zero-shot, one-shot, few-shot 학습 시나리오를 갖춘 In-Context Learning (ICL) 보안 패턴을 개발한다.
문제 전반에 걸쳐 네 가지 LLMs(PDCGs: ChatGPT-4, Google Bard; CCPs: GitHub Copilot, Amazon Code Whisperer)로 코드를 생성한다.
정적 애플리케이션 보안 테스트(SAST) 및 수동 보안 검토를 사용해 생성 코드를 평가하고 취약점과 숨겨진 코드 냄새를 식별한다.
소스 코드 수준에서 보안 위험 지표를 계산하고 ICL 후 코드 냄새의 지속성을 분석한다.
향후 LLM 보안 교육을 지원하기 위한 security-instructions 데이터셋을 발행한다.

Figure 1 : The LLM code generation process: simplified Transformer architecture for code generation with potential security risks

실험 결과

연구 질문

RQ1RQ1: 다양한 LLM이 제로샷 시나리오에서 다양한 프로그래밍 도전에 걸쳐 얼마나 보안적인 코드를 생성할 수 있는가?
RQ2RQ2: LLM이 ICL 보안 패턴을 사용한 일회/소수-shot 학습 이후 모범 사례를 이해하고 적용하며 취약점을 다루는 정도는 어느 정도인가?
RQ3RQ3: ICL 하에서 보안 코드를 생성하는 데 있어 PDCG LLM(ChatGPT-4, Google Bard)과 CCP LLM(GitHub Copilot, Code Whisperer)의 비교는 어떠한가?
RQ4RQ4: ICL 보안 패턴 사용 후 지속되는 보안 코드 냄새와 관련 위험은 무엇인가?

주요 결과

ICL 기반의 one-shot 및 few-shot 학습 패턴은 프로그래밍 시나리오 전반에서 코드 보안을 강화하고 취약점을 줄일 수 있다.
네 가지 다양한 LLM을 세 언어에 걸쳐 평가하여 PDCGs와 CCPs 간의 보안 코드 생성 차이를 드러낸다.
보안 테스트는 SAST 도구와 수동 코드 리뷰를 결합해 취약점과 숨겨진 코드 냄새를 식별한다.
코드 차원의 보안 위협을 정량화하기 위한 보안 위험 평가 프레임워크가 개발된다.
본 연구는 향후 보안 코드 생성을 위한 연구를 안내하기 위해 security-instructions 데이터셋의 공개를 제안한다.

Figure 2 : Prompt Driven Code Generators with BLLMs

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.