QUICK REVIEW

[논문 리뷰] How Effective Are They? Exploring Large Language Model Based Fuzz Driver Generation

Cen Zhang, Yaowen Zheng|arXiv (Cornell University)|2023. 07. 24.

Software Engineering Research참고 문헌 27인용 수 8

한 줄 요약

본 논문은 대형 언어 모델(LLMs)이 C 라이브러리 API용 fuzz 드라이버를 얼마나 잘 생성할 수 있는지 실증적으로 조사하고, 프롬프트 전략, 모델 유형, 온도 등을 분석하며 산업용 드라이버와 비교한다.

ABSTRACT

LLM-based (Large Language Model) fuzz driver generation is a promising research area. Unlike traditional program analysis-based method, this text-based approach is more general and capable of harnessing a variety of API usage information, resulting in code that is friendly for human readers. However, there is still a lack of understanding regarding the fundamental issues on this direction, such as its effectiveness and potential challenges. To bridge this gap, we conducted the first in-depth study targeting the important issues of using LLMs to generate effective fuzz drivers. Our study features a curated dataset with 86 fuzz driver generation questions from 30 widely-used C projects. Six prompting strategies are designed and tested across five state-of-the-art LLMs with five different temperature settings. In total, our study evaluated 736,430 generated fuzz drivers, with 0.85 billion token costs ($8,000+ charged tokens). Additionally, we compared the LLM-generated drivers against those utilized in industry, conducting extensive fuzzing experiments (3.75 CPU-year). Our study uncovered that: - While LLM-based fuzz driver generation is a promising direction, it still encounters several obstacles towards practical applications; - LLMs face difficulties in generating effective fuzz drivers for APIs with intricate specifics. Three featured design choices of prompt strategies can be beneficial: issuing repeat queries, querying with examples, and employing an iterative querying process; - While LLM-generated drivers can yield fuzzing outcomes that are on par with those used in the industry, there are substantial opportunities for enhancement, such as extending contained API usage, or integrating semantic oracles to facilitate logical bug detection. Our insights have been implemented to improve the OSS-Fuzz-Gen project, facilitating practical fuzz driver generation in industry.

연구 동기 및 목표

Widely-used C APIs에 대한 제로샷 LLM 기반 fuzz 드라이버 생성의 효과성을 평가한다.
LLMs를 이용한 고품질 fuzz 드라이버 생성의 주요 과제와 병목 현상을 식별한다.
다양한 프롬프트 전략, 모델 및 온도 설정이 성공률에 미치는 영향을 평가한다.
LLM-생성 fuzz 드라이버와 현업에서 사용되는 드라이버를 실제 fuzzing 시나리오에서 비교한다.
OSS-Fuzz-Gen와의 실용적 fuzz 드라이버 생성 및 통합을 개선하기 위한 실행 가능한 권고를 제공한다.

제안 방법

30개의 OSS-Fuzz C 프로젝트에서 86개의 fuzz 드라이버 생성 질문 데이터셋을 구성한다.
다섯 가지 최첨단 LLM과 다섯 가지 온도로 여섯 가지 fuzz 드라이버 생성 전략을 설계하고 평가한다.
생성된 드라이버를 컴파일, 단기적 fuzzing, API 사용 시맨틱 체크를 통해 자동으로 검증한다.
구성별로 성능을 분석하며 토큰 및 계산에서의 효율성과 비용을 측정한다.
장기간 실행 fuzzing 실행을 사용해 LLM-생성 드라이버를 산업용 드라이버와 비교한다.
현장 적용을 지원하기 위해 OSS-Fuzz-Gen에 인사이트를 구현한다.

실험 결과

연구 질문

RQ1RQ1: 현재의 LLM이 소프트웨어 테스트를 위한 효과적인 fuzz 드라이버를 얼마나 생성할 수 있는가?
RQ2RQ2: LLM을 사용해 효과적인 fuzz 드라이버를 생성하는 데 관련된 주요 과제는 무엇인가?
RQ3RQ3: 서로 다른 프롬프트 전략의 효과성과 특성은 무엇인가?
RQ4RQ4: LLM-생성 드라이버가 산업계에서 사용되는 드라이버와 비교하여 어떤 성능을 보이는가?

주요 결과

LLM 기반 fuzz 드라이버 생성은 강력한 가능성을 보여주지만, 높은 생성 비용과 의미론적 정확성 문제와 같은 실용적 과제에 직면한다.
세 가지 설계 선택이 효과를 높인다: 반복 질의, 확장된 API 정보 및 예시 사용, 수정이 포함된 반복 질의.
최적 구성(gpt-4-0613, ALL-ITER-K, 0.5)은 약 86개 중 78개(91%)의 질문을 해결했다.
낮은 온도(특히 0.5 또는 0.0)가 이 작업에 대해 일반적으로 더 나은 결과를 내며, 높은 온도는 성능을 저하시킨다.
오픈소스 LLM은 일부 독점 모델과 대등하거나 초과할 수 있으며(e.g., wizardcoder-15b-v1.0이 gpt-3.5-turbo-0613에 근접), 다만 구성에 따라 성능이 크게 달라진다.
LLM은 제한된 API 사용으로 fuzz 드라이버를 생성하므로 더 넓은 API 커버리지 및 시맨틱 오라클과 같은 개선 여지가 있다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.