QUICK REVIEW

[Paper Review] Do Attention Heads in BERT Track Syntactic Dependencies?

Phu Mon Htut, Jason Phang|arXiv (Cornell University)|Nov 27, 2019

Topic Modeling24 references88 citations

TL;DR

The paper analyzes whether individual attention heads in BERT, RoBERTa, and fine-tuned variants implicitly capture syntactic dependency relations, using Max attention and maximum spanning tree methods to extract dependencies and compare to UD trees.

ABSTRACT

We investigate the extent to which individual attention heads in pretrained transformer language models, such as BERT and RoBERTa, implicitly capture syntactic dependency relations. We employ two methods---taking the maximum attention weight and computing the maximum spanning tree---to extract implicit dependency relations from the attention weights of each layer/head, and compare them to the ground-truth Universal Dependency (UD) trees. We show that, for some UD relation types, there exist heads that can recover the dependency type significantly better than baselines on parsed English text, suggesting that some self-attention heads act as a proxy for syntactic structure. We also analyze BERT fine-tuned on two datasets---the syntax-oriented CoLA and the semantics-oriented MNLI---to investigate whether fine-tuning affects the patterns of their self-attention, but we do not observe substantial differences in the overall dependency relations extracted using our methods. Our results suggest that these models have some specialist attention heads that track individual dependency types, but no generalist head that performs holistic parsing significantly better than a trivial baseline, and that analyzing attention weights directly may not reveal much of the syntactic knowledge that BERT-style models are known to learn.

Motivation & Objective

Assess whether self-attention heads in BERT/RoBERTa track syntactic dependency relations.
Determine if certain heads act as specialists for specific dependencies (e.g., nsubj, obj).
Evaluate whether fine-tuning on syntax- or semantics-related tasks alters attention-based syntactic signals.
Compare extraction methods to ground-truth UD trees without additional training.
Contrast specialist heads versus holistic parsing capabilities of the models.

Proposed method

Extract dependency relations from each attention head and layer using the attention weight matrix.
Apply the Max method by selecting, for each token, the highest-attention parent to form relations.
Apply the Maximum Spanning Tree (MST) method to construct a complete dependency tree via Chu-Liu-Edmonds algorithm.
Evaluate extracted relations/trees against English Parallel Universal Dependencies (PUD) as gold standard.
Exclude special tokens and merge non-matching subtokens to align tokenization with model inputs.
Compare pretrained BERT/RoBERTa and fine-tuned variants (CoLA-BERT, MNLI-BERT) on relation extraction performance.

Experimental results

Research questions

RQ1Do individual attention heads in BERT/RoBERTa reliably encode specific syntactic dependency relations?
RQ2Can two simple, training-free methods (Max and MST) recover meaningful dependency structure from attention weights?
RQ3Does fine-tuning on syntax-oriented (CoLA) or semantics-oriented (MNLI) tasks alter the syntactic signals captured by attention heads?
RQ4Is there a generalist attention head that enables holistic parsing better than trivial baselines?

Key findings

Some attention heads specialize in tracking certain dependency types (e.g., nsubj, obj) with significantly higher accuracy than baselines.
Fine-tuning on MNLI improves long-distance clausal dependencies but slightly hurts shorter-distance dependencies; CoLA fine-tuning shows little impact.
MST-based trees from attention weights do not meaningfully outperform baselines, indicating a lack of generalist heads for holistic parsing.
Compared to random initializations and simple baselines, trained models surpass baselines on several dependency types, but overall UUAS gains are modest.
Fine-tuning on CoLA or MNLI does not drastically change overall self-attention patterns in the context of their analysis.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.