QUICK REVIEW

[论文解读] Discovery of Linguistic Relations Using Lexical Attraction

Deniz Yüret|ArXiv.org|May 27, 1998

Bayesian Modeling and Causal Inference参考文献 41被引用 95

一句话总结

本文提出了词汇吸引模型，这是一种基于信息论的概率框架，可直接在词语之间表示语言关系（如主语-动词或宾语-谓词）。通过将学习与处理过程交错进行，该系统从原始文本中自动学习，对内容词的句法关系识别达到60%的精确率和50%的召回率，优于以往的无监督方法，后者因陷入局部极值和表示能力不足而无法在原始输入上实现性能提升。

ABSTRACT

This work has been motivated by two long term goals: to understand how humans learn language and to build programs that can understand language. Using a representation that makes the relevant features explicit is a prerequisite for successful learning and understanding. Therefore, I chose to represent relations between individual words explicitly in my model. Lexical attraction is defined as the likelihood of such relations. I introduce a new class of probabilistic language models named lexical attraction models which can represent long distance relations between words and I formalize this new class of models using information theory. Within the framework of lexical attraction, I developed an unsupervised language acquisition program that learns to identify linguistic relations in a given sentence. The only explicitly represented linguistic knowledge in the program is lexical attraction. There is no initial grammar or lexicon built in and the only input is raw text. Learning and processing are interdigitated. The processor uses the regularities detected by the learner to impose structure on the input. This structure enables the learner to detect higher level regularities. Using this bootstrapping procedure, the program was trained on 100 million words of Associated Press material and was able to achieve 60% precision and 50% recall in finding relations between content-words. Using knowledge of lexical attraction, the program can identify the correct relations in syntactically ambiguous sentences such as ``I saw the Statue of Liberty flying over New York.''

研究动机与目标

理解人类如何学习语言，并构建具备语言理解能力的程序。
开发一种无需初始语法或词典即可从原始文本中习得语言结构的系统。
利用信息论将语言关系形式化为概率性的词汇吸引。
证明显式表示词语间关系可实现自举式学习与句法消歧。
克服短语结构形式化方法导致学习陷入局部极值的局限性。

提出的方法

词汇吸引被定义为两个词语之间句法关系的可能性，基于信息论原理进行形式化。
系统在可接受的树结构上采用均匀分布，重点在于学习词语层面的关系，而非解析概率。
学习与处理过程交错进行：处理器利用检测到的规律施加结构，而该结构反过来又使学习者能够发现更高级别的模式。
该模型避免过早泛化，从而防止出现不可逆的错误，实现从原始文本中稳健学习。
使用词语层面的表示而非词性标注，从而能够检测到常见用法和特殊用法。
系统通过利用处理器提供的结构反馈来迭代改进词汇吸引估计。

实验结果

研究问题

RQ1能否在无需初始语法或词性标注的情况下，直接从原始文本中学习语言关系？
RQ2如何利用信息论形式化词汇吸引，以表示远距离词语关系？
RQ3交错式学习与处理能否实现句法结构的自举式获取？
RQ4为何以往的无监督解析方法在原始文本上表现不佳，以及表示选择如何缓解此问题？
RQ5系统能否仅依靠词汇吸引知识解决句法歧义？

主要发现

在10000万词的原始AP文本上训练后，该系统在识别内容词之间关系方面达到了60%的精确率和50%的召回率。
与以往的无监督解析器不同，该模型在原始文本上表现出可测量的性能提升，避免了以往研究中常见的停滞现象。
使用词语层面表示而非词性标注，使系统能够检测到常见用法和特殊用法。
由于未过早泛化，系统未陷入不可逆的局部极值。
该模型成功通过利用词汇吸引解决了如“我看到自由女神像飞过纽约”这类句子中的句法歧义。
该框架证明，显式表示词语间关系可简化学习过程，并支持自举式获取。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。