QUICK REVIEW

[論文レビュー] Improving Deep Learning Library Testing with Machine Learning

Facundo Molina, M M Abid Naziri|arXiv (Cornell University)|Feb 3, 2026

Software Testing and Debugging Techniques被引用数 0

ひとこと要約

The paper trains ML classifiers to determine input validity for DL library APIs (TensorFlow and PyTorch), achieving over 91% generalization accuracy and boosting ACETest’s input pass rate from ~29% to ~60.7%.

ABSTRACT

Deep Learning (DL) libraries like TensorFlow and Pytorch simplify machine learning (ML) model development but are prone to bugs due to their complex design. Bug-finding techniques exist, but without precise API specifications, they produce many false alarms. Existing methods to mine API specifications lack accuracy. We explore using ML classifiers to determine input validity. We hypothesize that tensor shapes are a precise abstraction to encode concrete inputs and capture relationships of the data. Shape abstraction severely reduces problem dimensionality, which is important to facilitate ML training. Labeled data are obtained by observing runtime outcomes on a sample of inputs and classifiers are trained on sets of labeled inputs to capture API constraints. Our evaluation, conducted over 183 APIs from TensorFlow and Pytorch, shows that the classifiers generalize well on unseen data with over 91% accuracy. Integrating these classifiers into the pipeline of ACETest, a SoTA bug-finding technique, improves its pass rate from ~29% to ~61%. Our findings suggest that ML-enhanced input classification is an important aid to scale DL library testing.

研究の動機と目的

Motivate the need to reduce false alarms in API testing of DL libraries due to complex input constraints.
Investigate whether tensor-shape based abstractions can effectively encode inputs for ML models.
Assess generalization of ML classifiers to unseen input configurations.
Evaluate integration of ML classifiers within an API-level fuzzing pipeline (ACETest) to improve testing efficiency.

提案手法

Represent inputs for DL API testing using tensor shapes to capture data relationships.
Use AutoGluon AutoML to train classifiers (CatBoost, LightGBM, XGBoost, NeuralNetFastAI, ExtraTrees) on labeled inputs (valid/invalid).
Generate training data via random sampling and pairwise combinatorial testing to label inputs.
Evaluate precision and recall on held-out data and across 183 APIs (98 PyTorch, 85 TensorFlow).
Integrate the best-performing ML models as pre-filters in ACETest to filter inputs before execution.

実験結果

リサーチクエスチョン

RQ1RQ1 How effective are ML models in learning input constraints of DL library APIs?
RQ2RQ2 Do ML models generalize outside training data sets?
RQ3RQ3 Do ML models improve test input generation for DL library APIs?

主な発見

ML models achieve up to 88% precision and 82% recall (PyTorch) and up to 91% precision and 80% recall (TensorFlow) across random/pairwise strategies.
Models generalize to unseen data with over 91% accuracy in 183 operators, with only 20% of cases showing precision drops and 12% showing recall drops.
Pairwise data generation generally yields better precision/recall than random generation.
When integrated with ACETest, ML filtering raises the average pass rate from 29.1% to 60.7%.
Overall testing time for ACETest+ML is reduced (example: 7.4s vs 31s for 5K inputs in a case study); valid inputs per second improve.
The approach is effective especially for APIs with complex constraints, and artifacts are publicly available.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。