QUICK REVIEW

[論文レビュー] UNIFUZZ: A Holistic and Pragmatic Metrics-Driven Platform for Evaluating Fuzzers

Yuwei Li, Shouling Ji|arXiv (Cornell University)|Oct 5, 2020

Software Reliability and Analysis Research被引用数 30

ひとこと要約

UniFuzz はオープンソースの指標主導プラットフォームで、現実的なベンチマークと六つの指標カテゴリを用いて fuzzers の評価を標準化します。単一の fuzzer がすべてのプログラムで優位に立つわけではないことを示し、性能に影響を与える重要な要因を強調します。

ABSTRACT

A flurry of fuzzing tools (fuzzers) have been proposed in the literature, aiming at detecting software vulnerabilities effectively and efficiently. To date, it is however still challenging to compare fuzzers due to the inconsistency of the benchmarks, performance metrics, and/or environments for evaluation, which buries the useful insights and thus impedes the discovery of promising fuzzing primitives. In this paper, we design and develop UNIFUZZ, an open-source and metrics-driven platform for assessing fuzzers in a comprehensive and quantitative manner. Specifically, UNIFUZZ to date has incorporated 35 usable fuzzers, a benchmark of 20 real-world programs, and six categories of performance metrics. We first systematically study the usability of existing fuzzers, find and fix a number of flaws, and integrate them into UNIFUZZ. Based on the study, we propose a collection of pragmatic performance metrics to evaluate fuzzers from six complementary perspectives. Using UNIFUZZ, we conduct in-depth evaluations of several prominent fuzzers including AFL [1], AFLFast [2], Angora [3], Honggfuzz [4], MOPT [5], QSYM [6], T-Fuzz [7] and VUzzer64 [8]. We find that none of them outperforms the others across all the target programs, and that using a single metric to assess the performance of a fuzzer may lead to unilateral conclusions, which demonstrates the significance of comprehensive metrics. Moreover, we identify and investigate previously overlooked factors that may significantly affect a fuzzer's performance, including instrumentation methods and crash analysis tools. Our empirical results show that they are critical to the evaluation of a fuzzer. We hope that our findings can shed light on reliable fuzzing evaluation, so that we can discover promising fuzzing primitives to effectively facilitate fuzzer designs in the future.

研究の動機と目的

Assess the performance of existing fuzzers on a unified, pragmatic platform.
Identify usability flaws and improve reproducibility of fuzzing experiments.
Propose a comprehensive, six-category metric suite to evaluate fuzzers.
Provide real-world benchmarks and tooling to streamline crash triage and CVE matching.
Reveal factors such as instrumentation and crash analysis that impact fuzzer performance.

提案手法

Incorporate 35 usable fuzzers and 20 real-world benchmarks within a Docker-based, open-source platform.
Develop a six-category metrics framework: quantity, quality, speed, stability, coverage, and overhead.
Triaging crashes into unique bugs using ASan/GDB with a CVE keywords database for CVE matching.
Standardize evaluation procedures with repeated experiments (30 repetitions) and statistical analysis (e.g., Mann-Whitney U, Vargha-Delaney A12).
Offer pragmatic tooling for crash analysis, including de-duplication, CVE matching, and severity assessment.
Compare eight prominent coverage-based fuzzers (AFL, AFLFast, Angora, Honggfuzz, MOPT, QSYM, T-Fuzz, VUzzer64) on the benchmark suite.

実験結果

リサーチクエスチョン

RQ1How do different fuzzers compare under a uniform, pragmatic evaluation platform?
RQ2What is the impact of diverse benchmarks and realistic environments on fuzzing performance conclusions?
RQ3Which factors (e.g., instrumentation, crash analysis tools) significantly affect fuzzer effectiveness?
RQ4Do single-metric evaluations misrepresent fuzzer capabilities across real-world programs?
RQ5How can crash triage and CVE matching be systematically integrated into fuzzing evaluations?

主な発見

No fuzzer outperforms others across all target programs, indicating no universal superior method.
Relying on a single metric can yield biased or incomplete conclusions about fuzzer performance.
Instrumentation methods and crash analysis tools significantly influence evaluation outcomes.
A comprehensive, multi-metric framework provides more reliable guidance for fuzzer design and evaluation.
The platform discovers usability issues in existing fuzzers and fixes many flaws during integration.
Real-world benchmarks matter: performance patterns differ between real-world programs and synthetic benchmarks like LAVA-M.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。