[論文レビュー] Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns
ExpNet is a lightweight explainer that maps transformer attention patterns to token-level importance by supervised learning on human rationales, achieving cross-task generalization and outperforming baselines in F1 across SST-2, CoLA, and HateXplain.
Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.
研究の動機と目的
- Motivate the need for reliable, human-aligned explanations in transformer-based NLP models.
- Propose a supervised explainer that learns from attention patterns rather than fixed heuristics.
- Demonstrate cross-task generalization of explanations across diverse NLP tasks.
- Evaluate against a wide range of baselines and analyze architectural contributions.
提案手法
- Extract two-direction attention features from BERT’s final layer for each token: task-to-token (CLS to token) and token-to-task (token to CLS) across all heads.
- Represent each token by a 2H-dimensional feature vector combining all head-specific attention values.
- Train a lightweight MLP (ExpNet) to map token features to a binary importance score with a sigmoid output.
- Supervise ExpNet using human-provided word-level rationales aligned to token predictions, projecting word labels to subword tokens.
- Handle class imbalance with focal loss and apply correct-prediction filtering to train only on instances where the classifier is correct.
- Evaluate explanations with token-level F1 and AUROC, using a leave-one-task-out cross-task protocol on SST-2, CoLA, and HateXplain.

実験結果
リサーチクエスチョン
- RQ1Can a learned mapping from attention patterns to token importance generalize across NLP tasks?
- RQ2How does ExpNet compare to model-agnostic and attention-based baselines in cross-task settings?
- RQ3What architectural components of ExpNet contribute most to explanation quality across domains?
主な発見
| Explainer | SST-2 F1 | CoLA F1 | HateXplain F1 |
|---|---|---|---|
| RandomBaseline | 0.258 | 0.387 | 0.293 |
| SHAP | 0.330±0.033 | 0.330±0.056 | 0.276±0.007 |
| LIME | 0.347±0.033 | 0.323±0.053 | 0.290±0.007 |
| Integrated Gradients | 0.287±0.032 | 0.342±0.054 | 0.345±0.008 |
| RawAt | 0.327±0.029 | 0.353±0.054 | 0.362±0.007 |
| Rollout | 0.133±0.025 | 0.347±0.055 | 0.356±0.007 |
| LRP | 0.339±0.030 | 0.355±0.054 | 0.372±0.008 |
| FullLRP | 0.218±0.030 | 0.336±0.055 | 0.346±0.007 |
| GAE | 0.350±0.030 | 0.354±0.053 | 0.391±0.007 |
| CAM | 0.243±0.032 | 0.355±0.054 | 0.332±0.007 |
| GradCAM | 0.234±0.031 | 0.356±0.056 | 0.396±0.008 |
| AttCAT | 0.280±0.034 | 0.345±0.055 | 0.340±0.007 |
| MGAE | 0.350±0.030 | 0.354±0.053 | 0.391±0.007 |
| ExpNet | 0.398±0.024 | 0.468±0.079 | 0.473±0.007 |
- ExpNet achieves the highest token F1 across all three tasks in cross-task evaluation.
- On CoLA, ExpNet reaches F1 = 0.468 ± 0.079, a 31% relative improvement over the best baseline (GradCAM: 0.356).
- On HateXplain, ExpNet reaches F1 = 0.473 ± 0.007, a 19% improvement over the strongest baseline (GradCAM: 0.396).
- On SST-2, ExpNet achieves F1 = 0.398 ± 0.024, 14% higher than the best baseline (GAE/MGAE: 0.350).
- ExpNet demonstrates cross-task generalization by outperforming task-tuned baselines when trained on two datasets and evaluated on the held-out third.
- AUROC of ExpNet is consistently competitive, often above 0.7, with SST-2 showing top AUROC among methods.

より良い研究を、今すぐ始めましょう
論文設計から論文執筆まで、研究時間を劇的に削減しましょう。
クレジットカード登録不要
このレビューはAIが作成し、人間の編集者が確認しました。