[Paper Review] Are Pre-trained Language Models Aware of Phrases? Simple but Strong Baselines for Grammar Induction
The paper shows that zero-shot constituency trees can be induced from pre-trained Transformer LMs using syntactic distance over attention distributions, yielding strong English grammar induction baselines and revealing biases in English syntax.
With the recent success and popularity of pre-trained language models (LMs) in natural language processing, there has been a rise in efforts to understand their inner workings. In line with such interest, we propose a novel method that assists us in investigating the extent to which pre-trained LMs capture the syntactic notion of constituency. Our method provides an effective way of extracting constituency trees from the pre-trained LMs without training. In addition, we report intriguing findings in the induced trees, including the fact that pre-trained LMs outperform other approaches in correctly demarcating adverb phrases in sentences.
Motivation & Objective
- Investigate whether pre-trained language models capture constituency-like syntactic structure without training or task-specific modules.
- Extract constituency trees from pre-trained LMs using attention-based syntactic distance.
- Evaluate induced trees as a baseline for English grammar induction on PTB and MNLI.
- Analyze which LM layers and attention heads encode phrase structure information.
- Explore biases (e.g., right-skewness) to understand English syntactic tendencies in induced trees.
Proposed method
- Represent each word by averaging its subword representations to obtain word-level vectors from each LM layer.
- Compute syntactic distance d_i between adjacent words using a chosen distance function f on representations g(w_i) and g(w_{i+1}).
- Construct constituency trees from the distance vector d following Shen et al. (2018a,b) without any training or task-specific modules.
- Use multiple f (COS, L1, L2, JSD, HEL) and g (layer-wise representations, attention distributions) options to compare performance.
- Optionally inject a right-skewness bias into distances to explore English constituent preferences (λ · AVG(d) · (1 - linear term)).
- Evaluate across eight LM variants (BERT-base/large, GPT-2, RoBERTa-base/large, XLNet-base/large) with base and large configurations.
Experimental results
Research questions
- RQ1Can pre-trained LMs yield linguistically plausible constituency trees without fine-tuning or extra components?
- RQ2Which LM representations (layer, attention head, or their ensemble) best support zero-shot constituency induction?
- RQ3Do syntactic distance-based trees capture English right-branching tendencies when biases are added?
- RQ4How do induced parses compare to gold-standard PTB trees and to MNLI-derived parses across domains?
- RQ5What syntactic knowledge do different LMs especially capture (e.g., SBAR, VP, ADJP, ADVP)?
Key findings
- Pre-trained LMs provide competitive S-F1 scores for English grammar induction without additional training.
- Applying a right-skewness bias to syntactic distances further improves S-F1 by up to about 10 points, especially for SBAR and VP.
- Attention-based distances (G^d) often yield better parsing results than hidden representations (G^v).
- XLNet-based models frequently outperform others across layers, with middle layers commonly most informative for parsing.
- ADJP and ADVP categories are particularly well captured by certain LMs, while NP recall remains strong but not dominant.
- Using bias and larger models generally helps, and ensemble averages of attention distributions (per layer) often outperform individual heads.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.