[Paper Review] DeepMath - Deep Sequence Models for Premise Selection
The paper demonstrates that deep neural sequence models can effectively perform premise selection for large-scale automated theorem proving, outperforming handcrafted feature baselines and complementing them via ensembling.
We study the effectiveness of neural sequence models for premise selection in automated theorem proving, one of the main bottlenecks in the formalization of mathematics. We propose a two stage approach for this task that yields good results for the premise selection task on the Mizar corpus while avoiding the hand-engineered features of existing state-of-the-art models. To our knowledge, this is the first time deep learning has been applied to theorem proving on a large scale.
Motivation & Objective
- Motivate premise selection as a bottleneck in large-scale automated theorem proving.
- Develop neural models that learn from formalized proofs without hand-engineered features.
- Propose a two-stage embedding approach including definition-aware embeddings to improve symbol generalization.
- Evaluate neural premise selectors on the Mizar/Mizar Library corpus and compare with hand-crafted baselines.
Proposed method
- Represent conjectures and axioms as embeddings produced by stage-specific neural networks (character-level or word-level).
- Train a two-stage pipeline where stage 1 learns generic embeddings and stage 2 uses definition embeddings to integrate symbol definitions.
- Use a logistic classifier on concatenated conjecture-axiom embeddings to predict premise usefulness.
- Train with negative mining and asynchronous Adam optimization across multiple GPUs.
- Cache embeddings to allow evaluating many conjecture–axiom pairs efficiently.
Experimental results
Research questions
- RQ1Can deep neural networks learn useful premise relevance signals from large formal corpora without hand-engineered features?
- RQ2How do character-level, word-level, and definition-aware embeddings compare for premise selection?
- RQ3Does combining neural predictions with traditional features yield complementary gains in ATP success?
- RQ4What is the achievable improvement in automated theorem proving accuracy on Mizar with neural premise selection?
Key findings
- A two-stage neural approach (character-level and then word/definition embeddings) substantially improves premise selection over a k-NN baseline with hand-crafted features.
- The def-CNN-LSTM and def-CNN models outperform baselines, with the best ensemble achieving 74.25% of theorems proved within top-k premises (k up to 1024).
- Union of def-CNN and char-CNN matches or exceeds other neural models and reaches 69.8% of the test set; combining neural methods with k-NN yields 80.9% proved overall.
- Negative mining during training is crucial, nearly doubling the number of proved theorems at top-16 cutoff.
- Word-level embeddings built from stage-1 character CNN embeddings significantly improve results, outperforming pure word-CNN or RNN variants.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.