QUICK REVIEW

[论文解读] How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

Cen Zhang, Yaowen Zheng|arXiv (Cornell University)|Jul 24, 2023

Software Engineering Research参考文献 27被引用 8

一句话总结

本文实证研究大型语言模型（LLMs）在为 C 库 API 生成 fuzz driver 方面的效果，分析提示策略、模型类型和温度参数，并与工业驱动器进行比较。

ABSTRACT

LLM-based (Large Language Model) fuzz driver generation is a promising research area. Unlike traditional program analysis-based method, this text-based approach is more general and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: - While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; - LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; - While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.

研究动机与目标

评估零-shot 基于 LLM 的 fuzz driver 生成在广泛使用的 C API 上的有效性。
识别使用 LLM 生成高质量 fuzz driver 的主要挑战与瓶颈。
评估不同提示策略、模型和温度设置对成功率的影响。
在实际 fuzzing 场景中将 LLM 生成的 fuzz driver 与行业使用的驱动进行比较。
提供可操作的建议以改进实际 fuzz driver 生成和与 OSS-Fuzz-Gen 的集成。

提出的方法

从 30 个 OSS-Fuzz C 项目中汇总 86 个 fuzz driver 生成问题的数据集。
设计并评估六种 fuzz driver 生成策略，覆盖五种最先进的 LLM，五个温度设置。
通过编译、短期 fuzzing 和 API 使用语义检查自动验证生成的驱动。
在令牌和计算上衡量效率和成本，分析不同配置下的性能。
在长期运行的 fuzzing 任务中将 LLM 生成的驱动与行业驱动进行比较。
将见解应用于 OSS-Fuzz-Gen，以支持在行业中实际的 fuzz driver 生成。

实验结果

研究问题

RQ1RQ1: 目前的 LLM 在多大程度上可以生成有效的软件测试 fuzz driver？
RQ2RQ2: 使用 LLM 生成有效 fuzz driver 的主要挑战是什么？
RQ3RQ3: 不同提示策略的有效性和特征如何？
RQ4RQ4: 与行业使用的驱动相比，LLM 生成的驱动的表现如何？

主要发现

基于 LLM 的 fuzz driver 生成显示出较强潜力，但面临实际挑战，如高生成成本和语义正确性问题。
三个设计选择可提升有效性：重复查询、使用扩展的 API 信息和示例，以及带修正的迭代查询。
最佳配置（gpt-4-0613，ALL-ITER-K，0.5）解决了约 91% 的问题（78/86）。
较低的温度（特别是 0.5 或 0.0）通常对这项任务效果更好；温度过高会降低性能。
开源 LLM 可以匹配甚至超过某些专有模型（例如 wizardcoder-15b-v1.0 接近 gpt-3.5-turbo-0613），但性能因配置差异很大而有所不同。
LLMs 生成的 fuzz driver 在 API 使用方面存在局限性，表明需要改进，例如更广泛的 API 覆盖和语义预言机。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。