[论文解读] Improving Deep Learning Library Testing with Machine Learning
论文通过训练机器学习分类器来判定深度学习库 API 的输入有效性(TensorFlow 与 PyTorch),实现超过 91% 的泛化准确率,并将 ACETest 的输入通过率从约 29% 提升至 60.7%。
Deep Learning (DL) libraries like TensorFlow and Pytorch simplify machine learning (ML) model development but are prone to bugs due to their complex design. Bug-finding techniques exist, but without precise API specifications, they produce many false alarms. Existing methods to mine API specifications lack accuracy. We explore using ML classifiers to determine input validity. We hypothesize that tensor shapes are a precise abstraction to encode concrete inputs and capture relationships of the data. Shape abstraction severely reduces problem dimensionality, which is important to facilitate ML training. Labeled data are obtained by observing runtime outcomes on a sample of inputs and classifiers are trained on sets of labeled inputs to capture API constraints. Our evaluation, conducted over 183 APIs from TensorFlow and Pytorch, shows that the classifiers generalize well on unseen data with over 91% accuracy. Integrating these classifiers into the pipeline of ACETest, a SoTA bug-finding technique, improves its pass rate from ~29% to ~61%. Our findings suggest that ML-enhanced input classification is an important aid to scale DL library testing.
研究动机与目标
- Motivate the need to reduce false alarms in API testing of DL libraries due to complex input constraints.
- Investigate whether tensor-shape based abstractions can effectively encode inputs for ML models.
- Assess generalization of ML classifiers to unseen input configurations.
- Evaluate integration of ML classifiers within an API-level fuzzing pipeline (ACETest) to improve testing efficiency.
提出的方法
- Represent inputs for DL API testing using tensor shapes to capture data relationships.
- Use AutoGluon AutoML to train classifiers (CatBoost, LightGBM, XGBoost, NeuralNetFastAI, ExtraTrees) on labeled inputs (valid/invalid).
- Generate training data via random sampling and pairwise combinatorial testing to label inputs.
- Evaluate precision and recall on held-out data and across 183 APIs (98 PyTorch, 85 TensorFlow).
- Integrate the best-performing ML models as pre-filters in ACETest to filter inputs before execution.
实验结果
研究问题
- RQ1RQ1 How effective are ML models in learning input constraints of DL library APIs?
- RQ2RQ2 Do ML models generalize outside training data sets?
- RQ3RQ3 Do ML models improve test input generation for DL library APIs?
主要发现
- ML models achieve up to 88% precision and 82% recall (PyTorch) and up to 91% precision and 80% recall (TensorFlow) across random/pairwise strategies.
- Models generalize to unseen data with over 91% accuracy in 183 operators, with only 20% of cases showing precision drops and 12% showing recall drops.
- Pairwise data generation generally yields better precision/recall than random generation.
- When integrated with ACETest, ML filtering raises the average pass rate from 29.1% to 60.7%.
- Overall testing time for ACETest+ML is reduced (example: 7.4s vs 31s for 5K inputs in a case study); valid inputs per second improve.
- The approach is effective especially for APIs with complex constraints, and artifacts are publicly available.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。