[Paper Review] Valence Induction with a Head-Lexicalized PCFG
This paper proposes a head-lexicalized probabilistic context-free grammar (PCFG) combined with the EM algorithm and inside-outside learning to induce subcategorization frames (valences) for verbs and other content words from large corpora. By modeling head-driven syntactic structure and iteratively tuning probability parameters via frequency estimation, the method achieves accurate, domain-sensitive valence acquisition suitable for large-scale NLP applications.
This paper presents an experiment in learning valences (subcategorization frames) from a 50 million word text corpus, based on a lexicalized probabilistic context free grammar. Distributions are estimated using a modified EM algorithm. We evaluate the acquired lexicon both by comparison with a dictionary and by entropy measures. Results show that our model produces highly accurate frame distributions.
Motivation & Objective
- To address the challenge of automatically acquiring subcategorization frames for large lexical resources.
- To model valence patterns that vary across genres and domains, reflecting real linguistic variation.
- To develop a scalable, linguistically interpretable method for learning probabilistic subcategorization frames from raw text.
- To integrate word co-occurrence patterns (e.g., collocations) into syntactic structure for improved parsing and frame estimation.
- To enable iterative, data-driven tuning of grammar parameters using the EM algorithm and inside-outside procedure.
Proposed method
- The approach uses a head-lexicalized PCFG formalism where rules are annotated with head words, enabling lexicalized probability estimation.
- A modified inside-outside algorithm estimates head-lexicalized rule and lexical choice frequencies from a corpus, using the EM algorithm for iterative parameter tuning.
- The grammar employs phrasal-level complementation rules (e.g., vfp → vfc′ np) with head marking to project lexical heads up syntactic structure.
- A state or n-gram rule system enables robust parsing of nearly all sentences (97%) by modeling transitions between phrasal categories as a finite-state machine.
- The model computes sentence and tree probabilities via sum-max parsing: inside algorithm sums probabilities within chunks, while the highest-probability tree is selected above.
- Word selection is modeled via a head-conditional bigram model threaded through syntactic trees, capturing collocational tendencies.
Experimental results
Research questions
- RQ1Can a head-lexicalized PCFG with EM-based parameter tuning effectively induce subcategorization frames from large, unannotated corpora?
- RQ2How well does the model capture differences in valence frame usage across different text domains?
- RQ3To what extent does the inclusion of lexicalized probabilities and word co-occurrence modeling improve frame induction accuracy?
- RQ4Can the method scale to corpora of 10–100 million words while maintaining linguistic interpretability and computational feasibility?
- RQ5Does the learned probabilistic frame distribution reflect real linguistic variation, as measured by entropy across domains?
Key findings
- The system achieves better precision and competitive recall compared to other published systems on standard evaluation metrics.
- Entropy measures confirm that frame usage varies significantly across domains, validating the need for domain-sensitive models.
- The model learns accurate probability distributions over subcategorization frames, reflecting actual frequencies in the training data.
- The method supports iterative training and can process approximately 1 million words per day on a single machine.
- The memory footprint for a 5M-word model is about 90MB, and average parsing speed is 10.4 words per second on a Sun Sparc-20.
- The approach enables robust parsing of 97% of sentences using a state-based extension, despite not modeling full clause-level structure.
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.