[論文レビュー] How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation
paper empirically investigates how well large language models (LLMs) can generate fuzz drivers for C library APIs, analyzing prompting strategies, model types, and temperatures, and comparing against industrial drivers.
LLM-based (Large Language Model) fuzz driver generation is a promising research area. Unlike traditional program analysis-based method, this text-based approach is more general and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: - While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; - LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; - While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.
研究の動機と目的
- Assess the effectiveness of zero-shot LLM-based fuzz driver generation for widely-used C APIs.
- Identify the main challenges and bottlenecks in generating high-quality fuzz drivers with LLMs.
- Evaluate how different prompting strategies, models, and temperature settings affect success rates.
- Compare LLM-generated fuzz drivers with industry-used drivers in real fuzzing scenarios.
- Provide actionable recommendations to improve practical fuzz driver generation and integration with OSS-Fuzz-Gen.
提案手法
- Assemble a dataset of 86 fuzz driver generation questions from 30 OSS-Fuzz C projects.
- Design and evaluate six fuzz driver generation strategies across five state-of-the-art LLMs with five temperatures.
- Automatically validate generated drivers via compilation, short-term fuzzing, and API-usage semantic checks.
- Measure efficiency and cost in tokens and compute, analyzing performance across configurations.
- Compare LLM-generated drivers against industry drivers using long-running fuzzing runs.
- Implement insights into OSS-Fuzz-Gen to support practical fuzz driver generation in industry.
実験結果
リサーチクエスチョン
- RQ1RQ1: To what extent can current LLMs generate effective fuzz drivers for software testing?
- RQ2RQ2: What are the primary challenges associated with generating effective fuzz drivers using LLMs?
- RQ3RQ3: What are the effectiveness and characteristics for different prompting strategies?
- RQ4RQ4: How do LLM-generated drivers perform compared to those used in the industry?
主な発見
- LLM-based fuzz driver generation shows strong potential but faces practical challenges such as high generation costs and semantic correctness issues.
- Three design choices improve effectiveness: repeated queries, using extended API information and examples, and iterative querying with fixes.
- Optimal configuration (gpt-4-0613, ALL-ITER-K, 0.5) solved about 91% of the questions (78/86).
- Lower temperatures (especially 0.5 or 0.0) generally yield better results for this task; higher temperatures reduce performance.
- Open-source LLMs can match or exceed some proprietary models (e.g., wizardcoder-15b-v1.0 approaching gpt-3.5-turbo-0613), though performance varies widely by configuration.
- LLMs generate fuzz drivers with limited API usage, indicating room for improvements such as broader API coverage and semantic oracles.
より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。