[Paper Review] XPROAX-Local explanations for text classification with progressive neighborhood approximation
XPROAX proposes a local, model-agnostic explanation method for text classifiers that uses a two-stage progressive neighborhood approximation in a latent space, starting from counterfactual instances as landmarks and refining them through targeted sampling to generate meaningful factual and counterfactual instances. It achieves superior performance in explanation usefulness, stability, completeness, compactness, and correctness compared to SOTA baselines like LIME and XSPELLS.
The importance of the neighborhood for training a local surrogate model to approximate the local decision boundary of a black box classifier has been already highlighted in the literature. Several attempts have been made to construct a better neighborhood for high dimensional data, like texts, by using generative autoencoders. However, existing approaches mainly generate neighbors by selecting purely at random from the latent space and struggle under the curse of dimensionality to learn a good local decision boundary. To overcome this problem, we propose a progressive approximation of the neighborhood using counterfactual instances as initial landmarks and a careful 2-stage sampling approach to refine counterfactuals and generate factuals in the neighborhood of the input instance to be explained. Our work focuses on textual data and our explanations consist of both word-level explanations from the original instance (intrinsic) and the neighborhood (extrinsic) and factual- and counterfactual-instances discovered during the neighborhood generation process that further reveal the effect of altering certain parts in the input text. Our experiments on real-world datasets demonstrate that our method outperforms the competitors in terms of usefulness and stability (for the qualitative part) and completeness, compactness and correctness (for the quantitative part).
Motivation & Objective
- To address the lack of effective local neighborhood generation for text classification explanations due to high dimensionality and sparse data.
- To overcome limitations of random sampling in latent spaces used by existing methods like XSPELLS.
- To improve explanation quality by incorporating both intrinsic (original text words) and extrinsic (neighborhood words) word-level explanations.
- To develop a quantitative evaluation framework for explanation completeness, compactness, and correctness.
- To demonstrate that neighborhood exploration beyond the input text yields more comprehensive and stable explanations.
Proposed method
- XPROAX uses a generative autoencoder to map input texts into a neighborhood-preserving latent space.
- It initializes the neighborhood with counterfactual instances—texts that would change the model’s prediction—serving as landmarks.
- A two-stage sampling process progressively refines these counterfactuals: first generating more plausible counterfactuals, then generating factual instances in the local neighborhood.
- The method extracts word-level explanations from both the original input (intrinsic) and the generated neighborhood (extrinsic) for comprehensive insight.
- It constructs local surrogate models trained on the refined neighborhood to approximate the black-box decision boundary.
- An automatic evaluation framework quantifies explanations using completeness, compactness, and correctness metrics based on confidence drop after explanation-guided edits.
Experimental results
Research questions
- RQ1Can a progressive, landmark-based neighborhood approximation in the latent space improve the quality of local explanations for text classifiers?
- RQ2How does incorporating extrinsic words from the neighborhood affect explanation stability and usefulness compared to relying only on intrinsic words?
- RQ3To what extent does a structured sampling strategy outperform random sampling in latent space for generating faithful and meaningful neighbors?
- RQ4How do the proposed quantitative metrics—completeness, compactness, and correctness—correlate with human evaluation of explanation quality?
- RQ5Can the method maintain high fidelity and stability when applied to diverse text classification models and datasets?
Key findings
- XPROAX achieved the highest completeness in all experimental settings, with a mean confidence drop of 0.740 ± 0.22 on the Yelp-RF dataset and 0.825 ± 0.35 on Yelp-DNN.
- It achieved the highest compactness in three out of four settings, with a mean confidence drop per operation of 0.417 ± 0.33 on Yelp-RF and 0.302 ± 0.43 on Yelp-DNN.
- The method showed a significant improvement in correctness, with a ∆η (change in compactness when threshold increases from 0.1 to 0.3) of +0.153 on Yelp-RF and +0.206 on Yelp-DNN, outperforming XSPELLS and the baseline.
- On the Amazon datasets, XPROAX achieved a confidence drop of 0.506 ± 0.20 (completeness) and 0.354 ± 0.21 (compactness) with the RF model, and 0.665 ± 0.21 and 0.298 ± 0.25 with the DNN model.
- XPROAX outperformed LIME in completeness and compactness across all datasets, despite LIME achieving slightly higher correctness due to lower initial compactness.
- The results confirm that neighborhood exploration beyond the input text yields more comprehensive and stable explanations than methods relying solely on intrinsic words or random sampling in latent space.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.