QUICK REVIEW

[論文レビュー] WhiteFox: White-Box Compiler Fuzzing Empowered by Large Language Models

Chenyuan Yang, Yinlin Deng|arXiv (Cornell University)|Oct 24, 2023

Software Testing and Debugging Techniques被引用数 9

ひとこと要約

WhiteFoxは、LLMsを用いて最適化コードを分析しテスト入力を生成するホワイトボックス・コンパイラ・ファザーを導入し、より多くの最適化カバレッジを達成し、複数のコンパイラにわたって多くのバグを発見します。

ABSTRACT

Compiler correctness is crucial, as miscompilation can falsify program behaviors, leading to serious consequences. Fuzzing has been studied to uncover compiler defects. However, compiler fuzzing remains challenging: Existing arts focus on black- and grey-box fuzzing, which generates tests without sufficient understanding of internal compiler behaviors. Meanwhile, traditional white-box techniques, like symbolic execution, are computationally inapplicable to the giant codebase of compilers. Recent advances demonstrate that Large Language Models (LLMs) excel in code generation/understanding tasks. Nonetheless, guiding LLMs with compiler source-code information remains a missing piece of research in compiler testing. To this end, we propose WhiteFox, the first white-box compiler fuzzer using LLMs with source-code information to test compiler optimization, with a spotlight on detecting deep logic bugs in the deep learning (DL) compilers. WhiteFox adopts a multi-agent framework: an LLM-based analysis agent examines the low-level optimization source code and produces requirements on the high-level test programs that can trigger the optimization; an LLM-based generation agent produces test programs based on the summarized requirements. Additionally, optimization-triggering tests are used as feedback to enhance the generation on the fly. Our evaluation on the three most popular DL compilers (i.e., PyTorch Inductor, TensorFlow-XLA, and TensorFlow Lite) shows WhiteFox can generate high-quality test programs to exercise deep optimizations, practicing up to 8X more than state-of-the-art fuzzers. WhiteFox has found 101 bugs for the DL compilers, with 92 confirmed as previously unknown and 70 fixed. WhiteFox has been acknowledged by the PyTorch team and is being incorporated into its development workflow. Beyond DL compilers, WhiteFox can also be adapted for compilers in different domains.

研究の動機と目的

伝統的なシンボリック法やカバレッジ駆動法を超えた、信頼性の高いコンパイラ最適化とスケーラブルなホワイトボックス・ファジングの必要性を動機づける。
最適化実装を高レベルのテスト入力へ翻訳するための dual-LLM フレームワークを提案する。
最適化をトリガーするテストを使用して、反復的にテスト入力を改善するフィードバック・ループを開発する。
複数のコンパイラ上で WhiteFox を評価し、最適化カバレッジとバグ発見能力を評価する。

提案手法

分析用 LLM を使用して、低レベルの最適化コードを高レベルのトリガー要件へ、混合 NL および疑似コード形式で要約する。
生成用 LLM を使って、要約された要件を満たすテストプログラム（例: PyTorch モデル）を作成する。
最適化をトリガーするテストを、今後の生成を改善するための few-shot 例として追加するフィードバック・ループを実装する。
コンパイラに計装して、最適化がトリガーされたときやクラッシュ、結果の不整合をテスト・オラクルとして特定できるようにする。
探索と活用のバランスを取るために、Thompson Sampling ベースのマルチアーム・バンディットを適用して、効果的なトリガー例を選択する。

実験結果

リサーチクエスチョン

RQ1LLMs は低レベルの最適化実装を、コンパイラ最適化を効果的にトリガーする高レベルの入力要件へ翻訳できるか？
RQ2LLM-guided テスト入力は、多様なコンパイラにわたって既存のファザーよりも深い最適化カバレッジを生み出すか？
RQ3次のテスト生成を改善する上で、フィードバック・ループと例の選択戦略はどれほど有効か？

主な発見

WhiteFox は実験において、最先端のファザーより最大で 8x 多くの最適化を実行させる。
このフレームワークは、テスト対象のコンパイラ全体で 96 件のバグを発見し、うち 80 件は以前は未知、51 件はすでに修正済み。
WhiteFox は、3 つの DL コンパイラと LLVM を含む 4 つの SUT において高い最適化カバレッジを示す。
自然言語と疑似コードの要約の組み合わせは、最適化をトリガーする要件抽出を改善する。
トリガーとなるテストを few-shot の例として利用し、Thompson Sampling を用いたフィードバック・ループは、以降のテスト生成を改善する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。