QUICK REVIEW

[论文解读] Do you still need a manual smart contract audit?

Isaac David, Liyi Zhou|arXiv (Cornell University)|Jun 21, 2023

Internet Traffic Analysis and Secure E-voting被引用 19

一句话总结

本文评估 GPT-4-32k 与 Claude-v1.3-100k 在对 52 个易受攻击的 DeFi 智能合约进行自动化安全审计中的表现，发现 40% 的漏洞命中率且存在大量假阳性；变异测试显示最高可达约 78.8% 的真阳性，表明大型语言模型可以辅助但不能替代人工审计。

ABSTRACT

We investigate the feasibility of employing large language models (LLMs) for conducting the security audit of smart contracts, a traditionally time-consuming and costly process. Our research focuses on the optimization of prompt engineering for enhanced security analysis, and we evaluate the performance and accuracy of LLMs using a benchmark dataset comprising 52 Decentralized Finance (DeFi) smart contracts that have previously been compromised. Our findings reveal that, when applied to vulnerable contracts, both GPT-4 and Claude models correctly identify the vulnerability type in 40% of the cases. However, these models also demonstrate a high false positive rate, necessitating continued involvement from manual auditors. The LLMs tested outperform a random model by 20% in terms of F1-score. To ensure the integrity of our study, we conduct mutation testing on five newly developed and ostensibly secure smart contracts, into which we manually insert two and 15 vulnerabilities each. This testing yielded a remarkable best-case 78.7% true positive rate for the GPT-4-32k model. We tested both, asking the models to perform a binary classification on whether a contract is vulnerable, and a non-binary prompt. We also examined the influence of model temperature variations and context length on the LLM's performance. Despite the potential for many further enhancements, this work lays the groundwork for a more efficient and economical approach to smart contract security audits.

研究动机与目标

评估使用大型语言模型（LLMs）对智能合约进行安全审计的可行性。
确定 LLM 可检测的漏洞类型，并衡量相对于已知攻击的准确性。
分析上下文长度、温度参数和提示对 LLM 性能的影响。
通过对新创建的安全合约注入漏洞进行变异测试来评估鲁棒性。

提出的方法

使用 GPT-4-32k 与 Claude-v1.3-100k 的 API，对 52 个易受攻击的 DeFi 合约中的 38 种漏洞类型进行单-shot 的二分类。
提供智能合约源代码作为上下文，提示模型对每种漏洞类型回答 YES/NO。
汇总结果以计算真阳性、假阳性、真阴性、假阴性。
通过在五个新创建的安全合约中插入 2 或 15 个漏洞进行变异测试，并使用二进制和非二进制提示重新评估。
检查上下文长度（令牌限制）和温度对模型性能的影响。
提供两个推理过程链案例研究，展示提示和结果。

实验结果

研究问题

RQ1GPT-4-32k 与 Claude-v1.3-100k 能否可靠识别 DeFi 智能合约中的漏洞类型？
RQ2对已知易受攻击的合约，LLMs 的命中率是多少，以及它们产生了多少假阳性？
RQ3上下文长度和温度如何影响智能合约审计中的 LLM 性能？
RQ4变异测试是否揭示基于 LLM 的审计对未见漏洞的鲁棒性？

主要发现

模型	真阳性	真阴性	假阳性	假阴性
claude-v1.3-100k	26	1290	578	47
gpt-4-32k	32	1128	740	41
random	36.54	960.62	962.42	39.21

LLMs 识别了 52 次 DeFi 攻击中的 40% 漏洞类型。
总体存在 1318 个假阳性，表明仍需大量人工核查。
结合起来，GPT-4-32k 与 Claude-v1.3-100k 达到了 58/146 的漏洞类型命中率（40%），GPT-4-32k 的平均 F1 为 0.077，Claude-100k 为 0.076。
在五个合成合约上的变异测试，在某些条件下，GPT-4-32k 最高可达到 78.8% 的真阳性。
非二进制提示通常比二进制提示产生更高的真阳性率，表明更丰富的回答有助于漏洞发现。
更长的上下文长度往往降低性能，而 Claude 在较长上下文中显示出相对更好的真阳性。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。