[论文解读] UNIFUZZ: A Holistic and Pragmatic Metrics-Driven Platform for Evaluating Fuzzers
UniFuzz 提供一个开源、以度量驱动的平台,通过务实的基准和六个度量类别标准化对 fuzzers 的评估。它证明没有单一 fuzzers 在所有程序上都占据支配地位,并突出影响性能的关键因素。
A flurry of fuzzing tools (fuzzers) have been proposed in the literature, aiming at detecting software vulnerabilities effectively and efficiently. To date, it is however still challenging to compare fuzzers due to the inconsistency of the benchmarks, performance metrics, and/or environments for evaluation, which buries the useful insights and thus impedes the discovery of promising fuzzing primitives. In this paper, we design and develop UNIFUZZ, an open-source and metrics-driven platform for assessing fuzzers in a comprehensive and quantitative manner. Specifically, UNIFUZZ to date has incorporated 35 usable fuzzers, a benchmark of 20 real-world programs, and six categories of performance metrics. We first systematically study the usability of existing fuzzers, find and fix a number of flaws, and integrate them into UNIFUZZ. Based on the study, we propose a collection of pragmatic performance metrics to evaluate fuzzers from six complementary perspectives. Using UNIFUZZ, we conduct in-depth evaluations of several prominent fuzzers including AFL [1], AFLFast [2], Angora [3], Honggfuzz [4], MOPT [5], QSYM [6], T-Fuzz [7] and VUzzer64 [8]. We find that none of them outperforms the others across all the target programs, and that using a single metric to assess the performance of a fuzzer may lead to unilateral conclusions, which demonstrates the significance of comprehensive metrics. Moreover, we identify and investigate previously overlooked factors that may significantly affect a fuzzer's performance, including instrumentation methods and crash analysis tools. Our empirical results show that they are critical to the evaluation of a fuzzer. We hope that our findings can shed light on reliable fuzzing evaluation, so that we can discover promising fuzzing primitives to effectively facilitate fuzzer designs in the future.
研究动机与目标
- 在一个统一的、务实的平台上评估现有 fuzzers 的性能。
- 发现可用性缺陷并提升模糊测试实验的可重复性。
- 提出一个全面的六类别度量体系来评估 fuzzers。
- 提供真实世界基准和工具,以简化崩溃分流与 CVE 匹配。
- 揭示如仪器化方法与崩溃分析等影响 fuzzers 性能的因素。
提出的方法
- 在一个基于 Docker、开源的平台中整合 35 个可用 fuzzers 和 20 个现实世界基准。
- 开发一个六类别度量框架:数量、质量、速度、稳定性、覆盖率和开销。
- 使用 ASan/GDB 与 CVE 关键词数据库将崩溃分流为独特的错误,以用于 CVE 匹配。
- 通过重复实验(30 次重复)以及统计分析(例如 Mann-Whitney U、Vargha-Delaney A12)来标准化评估程序。
- 提供务实的崩溃分析工具,包括去重、CVE 匹配和严重性评估。
- 在基准套件上比较八个著名的基于覆盖率的 fuzzers(AFL, AFLFast, Angora, Honggfuzz, MOPT, QSYM, T-Fuzz, VUzzer64)。
实验结果
研究问题
- RQ1在统一、务实的评估平台下,不同 fuzzers 的比较如何?
- RQ2多样化基准和现实环境对 fuzzing 性能结论有何影响?
- RQ3哪些因素(例如仪器化、崩溃分析工具)显著影响 fuzzers 的有效性?
- RQ4单一度量评估是否在真实世界程序中错误地呈现 fuzzers 的能力?
- RQ5如何将崩溃分流和 CVE 匹配系统地整合到 fuzzing 评估中?
主要发现
- 没有任何 fuzzers 在所有目标程序上都优于其他者,说明不存在通用的优越方法。
- 依赖单一度量可能对 fuzzers 的性能得出偏见或不完整的结论。
- 仪器化方法与崩溃分析工具显著影响评估结果。
- 一个全面的多指标框架为 fuzzers 的设计与评估提供更可靠的指导。
- 该平台在现有 fuzzers 中发现可用性问题,并在集成过程中修复了许多缺陷。
- 真实世界基准很重要:真实世界程序的性能模式与像 LAVA-M 这样的合成基准不同。
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。