[Paper Review] Right for the Wrong Reasons: Diagnosing Syntactic Heuristics in Natural Language Inference
This paper introduces the HANS dataset to diagnose syntactic heuristics in NLI, shows state-of-the-art models rely on these fallible heuristics and perform poorly on HANS, and demonstrates that augmenting training with HANS-like data can reduce heuristic reliance.
A machine learning system can score well on a given test set by relying on heuristics that are effective for frequent example types but break down in more challenging cases. We study this issue within natural language inference (NLI), the task of determining whether one sentence entails another. We hypothesize that statistical NLI models may adopt three fallible syntactic heuristics: the lexical overlap heuristic, the subsequence heuristic, and the constituent heuristic. To determine whether models have adopted these heuristics, we introduce a controlled evaluation set called HANS (Heuristic Analysis for NLI Systems), which contains many examples where the heuristics fail. We find that models trained on MNLI, including BERT, a state-of-the-art model, perform very poorly on HANS, suggesting that they have indeed adopted these heuristics. We conclude that there is substantial room for improvement in NLI systems, and that the HANS dataset can motivate and measure progress in this area
Motivation & Objective
- Motivate and diagnose the use of shallow syntactic heuristics in natural language inference (NLI).
- Introduce HANS (Heuristic Analysis for NLI Systems) to test targeted heuristics.
- Evaluate leading NLI models on HANS to assess reliance on heuristics.
- Demonstrate that augmenting training with HANS-like examples can reduce heuristic-driven failures.
Proposed method
- Define three fallible syntactic heuristics: lexical overlap, subsequence, and constituent.
- Construct HANS by generating 10,000 examples per heuristic (30 templates total across heuristics) with controlled plausibility.
- Evaluate four popular NLI models (DA, ESIM, SPINN, BERT) trained on MNLI on HANS.
- Annotate HANS with entailment or non-entailment labels to test heuristic-driven predictions.
- Assess whether augmenting MNLI with HANS-like examples improves performance on HANS and related structure-dependent tasks.
Experimental results
Research questions
- RQ1Do NLI models adopt the proposed syntactic heuristics in practice?
- RQ2How do popular models perform on HANS subsets designed to test each heuristic?
- RQ3Can training with HANS-like examples reduce reliance on these heuristics without harming MNLI performance?
- RQ4What is the relative contribution of model architecture versus training data to heuristic susceptibility?
Key findings
- All four models perform well on MNLI but fail on HANS where heuristics lead to incorrect entailment predictions (accuracy near chance or below on non-entailment cases).
- DA and ESIM show near-zero performance across heuristic subsets, indicating reliance on lexical overlap despite lacking word order.
- SPINN shows relatively better performance on subsequence and constituent cases, suggesting some structural benefit from tree-based representations but not universal robustness.
- BERT performs better than other models on constituent and lexical overlap cases but remains far from perfect on HANS.
- Augmenting MNLI with HANS-like examples markedly improves HANS performance across models, though effects vary by architecture; MNLI performance is mixed depending on the model.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.