Skip to main content
QUICK REVIEW

[Paper Review] Valence Induction with a Head-Lexicalized PCFG

Glenn R. Carroll, Mats Rooth|ArXiv.org|May 5, 1998
Natural Language Processing Techniques14 references122 citations
TL;DR

This paper proposes a head-lexicalized probabilistic context-free grammar (PCFG) combined with the EM algorithm and inside-outside learning to induce subcategorization frames (valences) for verbs and other content words from large corpora. By modeling head-driven syntactic structure and iteratively tuning probability parameters via frequency estimation, the method achieves accurate, domain-sensitive valence acquisition suitable for large-scale NLP applications.

ABSTRACT

This paper presents an experiment in learning valences (subcategorization frames) from a 50 million word text corpus, based on a lexicalized probabilistic context free grammar. Distributions are estimated using a modified EM algorithm. We evaluate the acquired lexicon both by comparison with a dictionary and by entropy measures. Results show that our model produces highly accurate frame distributions.

Motivation & Objective

  • To address the challenge of automatically acquiring subcategorization frames for large lexical resources.
  • To model valence patterns that vary across genres and domains, reflecting real linguistic variation.
  • To develop a scalable, linguistically interpretable method for learning probabilistic subcategorization frames from raw text.
  • To integrate word co-occurrence patterns (e.g., collocations) into syntactic structure for improved parsing and frame estimation.
  • To enable iterative, data-driven tuning of grammar parameters using the EM algorithm and inside-outside procedure.

Proposed method

  • The approach uses a head-lexicalized PCFG formalism where rules are annotated with head words, enabling lexicalized probability estimation.
  • A modified inside-outside algorithm estimates head-lexicalized rule and lexical choice frequencies from a corpus, using the EM algorithm for iterative parameter tuning.
  • The grammar employs phrasal-level complementation rules (e.g., vfp → vfc′ np) with head marking to project lexical heads up syntactic structure.
  • A state or n-gram rule system enables robust parsing of nearly all sentences (97%) by modeling transitions between phrasal categories as a finite-state machine.
  • The model computes sentence and tree probabilities via sum-max parsing: inside algorithm sums probabilities within chunks, while the highest-probability tree is selected above.
  • Word selection is modeled via a head-conditional bigram model threaded through syntactic trees, capturing collocational tendencies.

Experimental results

Research questions

  • RQ1Can a head-lexicalized PCFG with EM-based parameter tuning effectively induce subcategorization frames from large, unannotated corpora?
  • RQ2How well does the model capture differences in valence frame usage across different text domains?
  • RQ3To what extent does the inclusion of lexicalized probabilities and word co-occurrence modeling improve frame induction accuracy?
  • RQ4Can the method scale to corpora of 10–100 million words while maintaining linguistic interpretability and computational feasibility?
  • RQ5Does the learned probabilistic frame distribution reflect real linguistic variation, as measured by entropy across domains?

Key findings

  • The system achieves better precision and competitive recall compared to other published systems on standard evaluation metrics.
  • Entropy measures confirm that frame usage varies significantly across domains, validating the need for domain-sensitive models.
  • The model learns accurate probability distributions over subcategorization frames, reflecting actual frequencies in the training data.
  • The method supports iterative training and can process approximately 1 million words per day on a single machine.
  • The memory footprint for a 5M-word model is about 90MB, and average parsing speed is 10.4 words per second on a Sun Sparc-20.
  • The approach enables robust parsing of 97% of sentences using a state-based extension, despite not modeling full clause-level structure.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.