QUICK REVIEW

[Paper Review] Valence Induction with a Head-Lexicalized PCFG

Glenn R. Carroll, Mats Rooth|ArXiv.org|May 5, 1998

Natural Language Processing Techniques14 references122 citations

TL;DR

This paper proposes a head-lexicalized probabilistic context-free grammar (PCFG) combined with the EM algorithm and inside-outside learning to induce subcategorization frames (valences) for verbs and other content words from large corpora. By modeling head-driven syntactic structure and iteratively tuning probability parameters via frequency estimation, the method achieves accurate, domain-sensitive valence acquisition suitable for large-scale NLP applications.

ABSTRACT

This paper presents an experiment in learning valences (subcategorization frames) from a 50 million word text corpus, based on a lexicalized probabilistic context free grammar. Distributions are estimated using a modified EM algorithm. We evaluate the acquired lexicon both by comparison with a dictionary and by entropy measures. Results show that our model produces highly accurate frame distributions.

Motivation & Objective

To address the challenge of automatically acquiring subcategorization frames for large lexical resources.
To model valence patterns that vary across genres and domains, reflecting real linguistic variation.
To develop a scalable, linguistically interpretable method for learning probabilistic subcategorization frames from raw text.
To integrate word co-occurrence patterns (e.g., collocations) into syntactic structure for improved parsing and frame estimation.
To enable iterative, data-driven tuning of grammar parameters using the EM algorithm and inside-outside procedure.

Proposed method

The approach uses a head-lexicalized PCFG formalism where rules are annotated with head words, enabling lexicalized probability estimation.
A modified inside-outside algorithm estimates head-lexicalized rule and lexical choice frequencies from a corpus, using the EM algorithm for iterative parameter tuning.
The grammar employs phrasal-level complementation rules (e.g., vfp → vfc′ np) with head marking to project lexical heads up syntactic structure.
A state or n-gram rule system enables robust parsing of nearly all sentences (97%) by modeling transitions between phrasal categories as a finite-state machine.
The model computes sentence and tree probabilities via sum-max parsing: inside algorithm sums probabilities within chunks, while the highest-probability tree is selected above.
Word selection is modeled via a head-conditional bigram model threaded through syntactic trees, capturing collocational tendencies.

Experimental results

Research questions

RQ1Can a head-lexicalized PCFG with EM-based parameter tuning effectively induce subcategorization frames from large, unannotated corpora?
RQ2How well does the model capture differences in valence frame usage across different text domains?
RQ3To what extent does the inclusion of lexicalized probabilities and word co-occurrence modeling improve frame induction accuracy?
RQ4Can the method scale to corpora of 10–100 million words while maintaining linguistic interpretability and computational feasibility?
RQ5Does the learned probabilistic frame distribution reflect real linguistic variation, as measured by entropy across domains?

Key findings

The system achieves better precision and competitive recall compared to other published systems on standard evaluation metrics.
Entropy measures confirm that frame usage varies significantly across domains, validating the need for domain-sensitive models.
The model learns accurate probability distributions over subcategorization frames, reflecting actual frequencies in the training data.
The method supports iterative training and can process approximately 1 million words per day on a single machine.
The memory footprint for a 5M-word model is about 90MB, and average parsing speed is 10.4 words per second on a Sun Sparc-20.
The approach enables robust parsing of 97% of sentences using a state-based extension, despite not modeling full clause-level structure.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.