QUICK REVIEW

[论文解读] Applying Reliability Metrics to Co-Reference Annotation

Rebecca J. Passonneau|arXiv (Cornell University)|Jan 1, 1997

Natural Language Processing Techniques参考文献 6被引用 18

一句话总结

本文提出使用Cohen's Kappa作为评估共指注释可靠性更可靠的度量标准，通过将召回率和精确率转化为列联表，证明了这些标准度量指标可能因偶然一致性而产生误导性的高分。关键贡献在于当不存在黄金标准注释时，确立Kappa作为更优的替代度量标准。

ABSTRACT

Studies of the contextual and linguistic factors that constrain discourse phenomena such as reference are coming to depend increasingly on annotated language corpora. In preparing the corpora, it is important to evaluate the reliability of the annotation, but methods for doing so have not been readily available.In this report, I present a method for computing reliability of coreference annotation. First I review a method for applying the information retrieval metrics of recall and precision to coreference annotation proposed by Marc Vilain and his collaborators. I show how this method makes it possible to construct contingency tables for computing Cohen's Kappa, a familiar reliability metric. By comparing recall and precision to reliability on the same data sets, I also show that recall and precision can be misleadingly high. Because Kappa factors out chance agreement among coders, it is a preferable measure for developing annotated corpora where no pre-existing target annotation exists.

研究动机与目标

为解决语言学语料库中评估共指注释质量缺乏可靠方法的问题。
评估传统信息检索度量标准（如召回率和精确率）是否足以衡量注释可靠性。
证明偶然一致性会夸大共指注释中的召回率和精确率，使其在缺乏黄金标准时具有误导性。
在不存在预设目标注释的情况下，确立Cohen's Kappa作为共指注释更合适的可靠性度量标准。
提供一种从共指注释构建列联表以计算Kappa的实用方法。

提出的方法

通过定义正确的和预测的共指链接，将Marc Vilain等人提出的召回率和精确率框架适配到共指注释中。
基于标注者的一致与不一致情况，构建列联表（真正例、假正例、假反例、真反例）。
利用列联表计算Cohen's Kappa，该统计量可校正标注者之间的偶然一致性。
将Kappa统计量应用于真实的共指注释数据，与召回率和精确率进行可靠性比较。
通过对比同一数据集上Kappa值与召回率和精确率，验证该方法的有效性。
证明Kappa相较于仅使用召回率和精确率，能提供更准确的标注者间一致性评估。

实验结果

研究问题

RQ1召回率和精确率能否作为共指注释质量的可靠指标？
RQ2偶然一致性在多大程度上会夸大共指注释中的召回率和精确率？
RQ3Cohen's Kappa是否比召回率和精确率更适合用作共指注释的可靠性度量标准？
RQ4如何从共指注释中构建列联表以计算Kappa？
RQ5当不存在黄金标准注释时，Kappa是否能提供更准确的可靠性评估？

主要发现

由于标注者之间的偶然一致性，即使注释质量较低，召回率和精确率仍可能产生误导性的高分。
Cohen's Kappa能有效校正偶然一致性，提供更准确的标注者间一致性度量。
该方法可从共指注释中构建列联表，从而在无黄金标准注释的情况下实现Kappa的计算。
在不存在预设目标注释的场景中，Kappa被证明是更优的可靠性度量标准。
与仅使用召回率和精确率相比，Kappa能带来更保守且更可信的共指注释质量评估。
所提出的方法为话语注释项目中评估注释可靠性提供了实用且可靠的框架。

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。