QUICK REVIEW

[논문 리뷰] Evaluating Explainable AI Attribution Methods in Neural Machine Translation via Attention-Guided Knowledge Distillation

Aria Nourbakhsh, Salima Lamsiyah|arXiv (Cornell University)|2026. 03. 11.

Explainable Artificial Intelligence (XAI)인용 수 0

한 줄 요약

본 논문은 Transformer 기반 NMT의 XAI 속성 방법을 평가하기 위해 교사-학생 프레임워크를 제안하고, 교사 유래 속성 맵을 학생 모델의 어텐션에 주입하며, 다수의 속성 방법을 언어 쌍 간 비교하고 어텐션 유래 속성이 종종 최상의 개선을 가져다줌을 보인다.

ABSTRACT

The study of the attribution of input features to the output of neural network models is an active area of research. While numerous Explainable AI (XAI) techniques have been proposed to interpret these models, the systematic and automated evaluation of these methods in sequence-to-sequence (seq2seq) models is less explored. This paper introduces a new approach for evaluating explainability methods in transformer-based seq2seq models. We use teacher-derived attribution maps as a structured side signal to guide a student model, and quantify the utility of different attribution methods through the student's ability to simulate targets. Using the Inseq library, we extract attribution scores over source-target sequence pairs and inject these scores into the attention mechanism of a student transformer model under four composition operators (addition, multiplication, averaging, and replacement). Across three language pairs (de-en, fr-en, ar-en) and attributions from Marian-MT and mBART models, Attention, Value Zeroing, and Layer Gradient $ imes$ Activation consistently yield the largest gains in BLEU (and corresponding improvements in chrF) relative to baselines. In contrast, other gradient-based methods (Saliency, Integrated Gradients, DeepLIFT, Input $ imes$ Gradient, GradientShap) lead to smaller and less consistent improvements. These results suggest that different attribution methods capture distinct signals and that attention-derived attributions better capture alignment between source and target representations in seq2seq models. Finally, we introduce an Attributor transformer that, given a source-target pair, learns to reconstruct the teacher's attribution map. Our findings demonstrate that the more accurately the Attributor can reproduce attribution maps, the more useful an injection of those maps is for the downstream task. The source code can be found on GitHub.

연구 동기 및 목표

seq2seq NMT를 위한 Explainable AI 속성 방법의 자동화되고 작업-특정한 평가를 촉진한다.
속성 맵이 학습 중에 학생 트랜스포머를 안내하는 교사-학생 프레임워크를 제안한다.
여러 속성 방법을 어텐션 메커니즘에 주입하여 체계적으로 비교한다.
Transformer NMT에서 원천-타깃 정렬을 가장 잘 포착하는 속성 유형을 조사한다.

제안 방법

Inseq 라이브러리를 사용하여 교사 NMT 모델에서 여덟 가지 XAI 방법으로 속성 맵을 추출한다.
속성 맵을 정규화하고 네 가지 구성 연산자(Addition, Multiplication, Averaging, Replacement)를 통해 인코더-디코더 어텐션에 주입한다.
속성 보강된 어텐션을 사용하는 교사-강제 입력으로 학습된 학생 모델이 아닌 학습되지 않은 학생 모델을 훈련한다.
언어 쌍과 교사 모델에 걸쳐 MT 품질 지표(BLEU, chrF)로 학생의 성능을 평가한다.
교사의 속성 맵 재구성을 학습하고 그 정확도와 다운스트림 MT 성능 간의 상관관계를 확인하는 Attributor 트랜스포머를 도입한다.
어떤 속성 소스(예: 교사 어텐션)와 어떤 네트워크 구성 요소(인코더 어텐션)가 가장 강한 효과를 낳는지 분석한다.

실험 결과

연구 질문

RQ1속성 가이드 어텐션 사전지식이 교사 설정에서 골드 번역을 재현하거나 오라클 설정하에서 교사를 모방하는 데 학생 모델에 어느 정도 도움을 줄 수 있는가?
RQ2어떤 XAI 속성 방법이 NMT에서 교사의 입력–출력 동작을 시뮬레이션하는 데 가장 도움이 되는 맵을 생성하는가?
RQ3어텐션에 속성 신호를 주입하는 것이 언어 쌍 및 모델 간 번역 품질에 어떤 영향을 미치는가?
RQ4Attributor 네트워크가 속성 맵을 재현하는 능력이 그 맵을 사용할 때 다운스트림 MT 성능과 상관관계가 있는가?
RQ5외부 주입 속성 신호에 가장 민감한 어텐션 구성 요소(인코더 대 디코더)는 무엇인가?

주요 결과

어텐션으로부터 유래된 속성, 특히 Attention, Value Zeroing, 및 Layer Gradient × Activation은 기준치 대비 BLEU와 chrF의 가장 큰 향상을 language pairs 및 teacher models에 걸쳐 가져왔다.
다른 gradient 기반 방법들(Saliency, Integrated Gradients, DeepLIFT, Input × Gradient, GradientShap)은 더 작고 일관성 없는 개선을 보였다.
교사의 어텐션에 연결된 속성 맵이 학생을 안내하는 데 더 효과적인 경향이 있으며, 이는 대상 토큰당 상위 3개의 salient 점수를 재현하는 Attributor의 능력과 일치한다.
Attributor가 속성 맵을 재구성하는 데 성공하는 정도는 그 맵을 사용할 때 학생의 MT 성능과 강하게 상관된다.
속성 신호를 주입하는 효과는 디코더 구성 요소보다 인코더 어텐션에 적용될 때 더 강하게 나타난다.

더 나은 연구,지금 바로 시작하세요

연구 설계부터 논문 작성까지, 연구 시간을 획기적으로 줄여보세요.

카드 등록 없음 · 무료 플랜 제공

이 리뷰는 AI가 만들고, 인간 에디터가 검토했습니다.