QUICK REVIEW

[论文解读] Lost at C: A User Study on the Security Implications of Large Language Model Code Assistants

Gustavo Sandoval, Hammond Pearce|arXiv (Cornell University)|Aug 20, 2022

Software Engineering Research被引用 39

一句话总结

本论文进行以安全为重点的用户研究（N=58），评估基于 Codex 的 AI 代码建议是否在学生实现一个 C 语言单链购物清单时增加安全漏洞；研究结果显示安全影响较小，AI 辅助代码中的漏洞数量不超过对照组的 10% 增幅。

ABSTRACT

Large Language Models (LLMs) such as OpenAI Codex are increasingly being used as AI-based coding assistants. Understanding the impact of these tools on developers' code is paramount, especially as recent work showed that LLMs may suggest cybersecurity vulnerabilities. We conduct a security-driven user study (N=58) to assess code written by student programmers when assisted by LLMs. Given the potential severity of low-level bugs as well as their relative frequency in real-world projects, we tasked participants with implementing a singly-linked 'shopping list' structure in C. Our results indicate that the security impact in this setting (low-level C with pointer and array manipulations) is small: AI-assisted users produce critical security bugs at a rate no greater than 10% more than the control, indicating the use of LLMs does not introduce new security risks.

研究动机与目标

通过评估 AI 代码助手在接近真实世界的编码任务中是否影响安全性来为研究提供动机。
评估 AI 辅助代码与非辅助代码在安全漏洞发生率上的差异。
分析在 AI 辅助的编码任务中，安全漏洞的起源及在哪里产生（人为编写 vs. AI 提议的代码）。
提供可复现和更广泛分析的开放数据。

提出的方法

随机对照设计，分为两组：对照组（无 Codex 访问）和辅助组（有 Codex 访问）。
基于云的 IDE 记录用户与 Codex 的交互，以分析对 AI 建议的接受度。
参与者完成 12 个 C 函数，针对一个单链购物清单，以加剧与内存相关的漏洞。
使用 CWE 分类对安全性和功能性进行评估，并结合静态/运行时分析与人工评审。
Autopilot 条件完全由 Codex 生成代码以便比较（跨三个 Codex 模型的 30 个解）。
统计框架包含有效性比较检验和以 10% 漏洞边际为非劣效性的检验。

实验结果

研究问题

RQ1RQ1：AI 代码助手是否帮助初学者编写更具功能性的代码？
RQ2RQ2：与非辅助代码相比，AI 辅助的解决方案是否具有可接受的安全漏洞发生率？
RQ3RQ3：在 LLM 辅助系统中漏洞起源于何处（人为编写的代码 vs. AI 建议的代码）？

主要发现

AI 辅助的用户生成的代码具备与以往工作一致的功能性及生产力提升。
AI 辅助组的安全漏洞在每行代码的发生率上不超过对照组的 10% 增幅。
63% 的漏洞出现在人为编写的代码中，36% 出现在 AI 建议的代码中。
研究使用 CWE 分类法对漏洞进行分类，并结合人工分析与静态和运行时检查。
研究的数据和材料以开源形式提供。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。