QUICK REVIEW

[论文解读] A Divide-and-Conquer Strategy for Parsing

Li-Shiuan Peh, Christopher Ting Hian Ann|ArXiv.org|Jul 16, 1996

Natural Language Processing Techniques参考文献 5被引用 23

一句话总结

本文提出一种分而治之的策略，通过在解析前简化复杂句子来提高解析准确率。该方法对连接词（如并列连词、标点符号）进行消歧，将句子分割为子句和名词短语，分别解析后整合结果——在依赖解析器上应用该方法后，于IPSM’95数据集上将解析错误减少了21.2%。

ABSTRACT

In this paper, we propose a novel strategy which is designed to enhance the accuracy of the parser by simplifying complex sentences before parsing. This approach involves the separate parsing of the constituent sub-sentences within a complex sentence. To achieve that, the divide-and-conquer strategy first disambiguates the roles of the link words in the sentence and segments the sentence based on these roles. The separate parse trees of the segmented sub-sentences and the noun phrases within them are then synthesized to form the final parse. To evaluate the effects of this strategy on parsing, we compare the original performance of a dependency parser with the performance when it is enhanced with the divide-and-conquer strategy. When tested on 600 sentences of the IPSM'95 data sets, the enhanced parser saw a considerable error reduction of 21.2% in its accuracy.

研究动机与目标

为解决句子长度和复杂度增加导致的解析准确率下降问题。
通过在解析前简化长而复杂的句子来降低解析复杂度。
通过模块化、输入级的预处理策略提升依赖解析器的准确率。
评估基于连接词消歧与名词短语解析的分割策略的有效性。
证明无需修改底层解析算法即可提升解析准确率。

提出的方法

消歧句子中连接词（并列连词、介词、标点符号）的句法角色。
基于已消歧的连接词将句子分割为子句并提取名词短语。
使用基础依赖解析器独立解析每个子句和名词短语。
通过连接词附加和子树结构整合的方式合成独立的解析树。
使用基于规则的合成引擎将子树结果整合为完整的最终解析树。
将该策略应用于依赖解析器，通过修改合成步骤，可适配到成分解析器。

实验结果

研究问题

RQ1能否通过在解析前简化复杂输入句子来提升解析准确率？
RQ2连接词消歧在实现准确句子分割方面的有效性如何？
RQ3与整体解析相比，子句解析在多大程度上减少了解析错误？
RQ4词性标注器的性能对消歧与分割阶段有何影响？
RQ5分而治之策略能否在不同解析器架构间通用？

主要发现

该分而治之策略在IPSM’95测试集上将解析错误减少了21.2%，词级别准确率从81.1%提升至85.1%。
该策略通过限制较短子句中每个词的潜在核心词数量，显著降低了统计困惑度。
连接词角色消歧错误（如将'or'误分类为从句连词）会直接传播至分割和最终解析错误。
原始解析器在复杂句子中错误地将'if'和'and'连接至错误的核心词，而增强后的解析器正确识别了它们的句法角色并进行了相应分割。
该方法在名词短语解析（97.0%精确匹配）和连接词消歧（93.3%–96.8%准确率）方面表现优异，证明了其可靠性。
尽管训练语料规模较小（1,812个句子），该方法在多个数据集（Dynix、Lotus、Trados）上均实现了稳定且可量化的性能提升。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。