QUICK REVIEW

[論文レビュー] Learning to Explain: Supervised Token Attribution from Transformer Attention Patterns

George A. Mihaila|arXiv (Cornell University)|Jan 20, 2026

Explainable Artificial Intelligence (XAI)被引用数 0

ひとこと要約

ExpNet is a lightweight explainer that maps transformer attention patterns to token-level importance by supervised learning on human rationales, achieving cross-task generalization and outperforming baselines in F1 across SST-2, CoLA, and HateXplain.

ABSTRACT

Explainable AI (XAI) has become critical as transformer-based models are deployed in high-stakes applications including healthcare, legal systems, and financial services, where opacity hinders trust and accountability. Transformers self-attention mechanisms have proven valuable for model interpretability, with attention weights successfully used to understand model focus and behavior (Xu et al., 2015); (Wiegreffe and Pinter, 2019). However, existing attention-based explanation methods rely on manually defined aggregation strategies and fixed attribution rules (Abnar and Zuidema, 2020a); (Chefer et al., 2021), while model-agnostic approaches (LIME, SHAP) treat the model as a black box and incur significant computational costs through input perturbation. We introduce Explanation Network (ExpNet), a lightweight neural network that learns an explicit mapping from transformer attention patterns to token-level importance scores. Unlike prior methods, ExpNet discovers optimal attention feature combinations automatically rather than relying on predetermined rules. We evaluate ExpNet in a challenging cross-task setting and benchmark it against a broad spectrum of model-agnostic methods and attention-based techniques spanning four methodological families.

研究の動機と目的

Motivate the need for reliable, human-aligned explanations in transformer-based NLP models.
Propose a supervised explainer that learns from attention patterns rather than fixed heuristics.
Demonstrate cross-task generalization of explanations across diverse NLP tasks.
Evaluate against a wide range of baselines and analyze architectural contributions.

提案手法

Extract two-direction attention features from BERT’s final layer for each token: task-to-token (CLS to token) and token-to-task (token to CLS) across all heads.
Represent each token by a 2H-dimensional feature vector combining all head-specific attention values.
Train a lightweight MLP (ExpNet) to map token features to a binary importance score with a sigmoid output.
Supervise ExpNet using human-provided word-level rationales aligned to token predictions, projecting word labels to subword tokens.
Handle class imbalance with focal loss and apply correct-prediction filtering to train only on instances where the classifier is correct.
Evaluate explanations with token-level F1 and AUROC, using a leave-one-task-out cross-task protocol on SST-2, CoLA, and HateXplain.

Figure 1: ExpNet complete pipeline from BERT’s attention patterns to importance predictions.

実験結果

リサーチクエスチョン

RQ1Can a learned mapping from attention patterns to token importance generalize across NLP tasks?
RQ2How does ExpNet compare to model-agnostic and attention-based baselines in cross-task settings?
RQ3What architectural components of ExpNet contribute most to explanation quality across domains?

主な発見

Explainer	SST-2 F1	CoLA F1	HateXplain F1
RandomBaseline	0.258	0.387	0.293
SHAP	0.330±0.033	0.330±0.056	0.276±0.007
LIME	0.347±0.033	0.323±0.053	0.290±0.007
Integrated Gradients	0.287±0.032	0.342±0.054	0.345±0.008
RawAt	0.327±0.029	0.353±0.054	0.362±0.007
Rollout	0.133±0.025	0.347±0.055	0.356±0.007
LRP	0.339±0.030	0.355±0.054	0.372±0.008
FullLRP	0.218±0.030	0.336±0.055	0.346±0.007
GAE	0.350±0.030	0.354±0.053	0.391±0.007
CAM	0.243±0.032	0.355±0.054	0.332±0.007
GradCAM	0.234±0.031	0.356±0.056	0.396±0.008
AttCAT	0.280±0.034	0.345±0.055	0.340±0.007
MGAE	0.350±0.030	0.354±0.053	0.391±0.007
ExpNet	0.398±0.024	0.468±0.079	0.473±0.007

ExpNet achieves the highest token F1 across all three tasks in cross-task evaluation.
On CoLA, ExpNet reaches F1 = 0.468 ± 0.079, a 31% relative improvement over the best baseline (GradCAM: 0.356).
On HateXplain, ExpNet reaches F1 = 0.473 ± 0.007, a 19% improvement over the strongest baseline (GradCAM: 0.396).
On SST-2, ExpNet achieves F1 = 0.398 ± 0.024, 14% higher than the best baseline (GAE/MGAE: 0.350).
ExpNet demonstrates cross-task generalization by outperforming task-tuned baselines when trained on two datasets and evaluated on the held-out third.
AUROC of ExpNet is consistently competitive, often above 0.7, with SST-2 showing top AUROC among methods.

Figure 2: AUROC values across datasets show ExpNet consistently achieves competitive ranking performance (often above 0.7), generally assigning higher scores to important tokens and lower scores to unimportant ones more effectively than most baselines.

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。