Skip to main content
QUICK REVIEW

[Paper Review] A Multi-Axis Annotation Scheme for Event Temporal Relations

Ning Qiang, Hao Wu|arXiv (Cornell University)|Apr 20, 2018
Semantic Web and Ontologies24 references71 citations
TL;DR

This paper introduces a multi-axis annotation scheme for event temporal relations, emphasizes start-point anchoring over end-points, and demonstrates improved inter-annotator agreement and crowdsourced dataset MATRES for start-points.

ABSTRACT

Existing temporal relation (TempRel) annotation schemes often have low inter-annotator agreements (IAA) even between experts, suggesting that the current annotation task needs a better definition. This paper proposes a new multi-axis modeling to better capture the temporal structure of events. In addition, we identify that event end-points are a major source of confusion in annotation, so we also propose to annotate TempRels based on start-points only. A pilot expert annotation using the proposed scheme shows significant improvement in IAA from the conventional 60's to 80's (Cohen's Kappa). This better-defined annotation scheme further enables the use of crowdsourcing to alleviate the labor intensity for each annotator. We hope that this work can foster more interesting studies towards event understanding.

Motivation & Objective

  • Address the low inter-annotator agreement in TempRel annotation schemes by redefining the task on multiple semantic axes.
  • Improve reliability by focusing on start-points rather than end-points due to end-point ambiguity.
  • Enable scalable data collection through crowdsourcing with robust quality control.
  • Provide a new dataset MATRES that supports start-point based TempRel annotation and compare with TB-Dense.
  • Offer a baseline system to show improved task definability and potential gains for TempRel extraction.

Proposed method

  • Propose a multi-axis modeling where events are anchored to semantic axes (main axis and orthogonal axes) and only same-axis pairs are compared.
  • Adopt interval splitting to decompose time interval relations into start-point comparisons (start-start, start-end, end-start, end-end) to maintain expressivity while simplifying labeling.
  • Focus annotation on start-points (t_start) and treat end-point comparisons as not needed for the current task, citing higher ambiguity and lower reliability.
  • Implement a two-step crowdsourcing workflow: anchorability annotation to determine if an event is anchorable on a given axis, followed by relation annotation between anchorable events.
  • Introduce quality control for crowdsourcing (gold questions, eligibility tests, majority-vote aggregation) and a procedure to handle vague relations via a structured Q1/Q2 ambiguity check for start-point orders.

Experimental results

Research questions

  • RQ1Can a multi-axis annotation scheme improve IAA for TempRel annotation compared to traditional single-axis schemes?
  • RQ2Does focusing on start-points reduce cognitive load and annotation errors, enabling reliable crowdsourced TempRel datasets?
  • RQ3How does the MATRES dataset compare to TB-Dense in terms of annotation quality and cross-dataset agreement?
  • RQ4What is the impact of the new annotation scheme on TempRel extraction performance using a baseline classifier?

Key findings

LabelTraining_PTraining_RTraining_F1Testing_PTesting_RTesting_F1
Before.74.91.82.71.80.75
After.73.77.75.55.64.59
Equal10.050.09---
Vague.75.28.41.29.13.18
Overall.73.81.77.66.72.69
Original.44.67.53.40.60.48
  • Pilot expert annotation yielded an IAA of 0.84 (Cohen’s Kappa) on the main axis, significantly higher than prior ~0.60 IAA.
  • Crowdsourcing with quality controls produced reliable annotations: anchorability and relation steps achieved high accuracy on gold data and strong worker agreement.
  • MATRES shows substantial improvements in TempRel annotation clarity and annotation reliability, enabling start-point based labeling and orthogonal axes.
  • Compared to TB-Dense, MATRES provides better start-point alignment and shows reasonable agreement with the gold standard and crowd consensus.
  • A baseline averaged perceptron system on MATRES achieves competitive F1 scores, demonstrating the task is well-defined and learnable under the new scheme.

Better researchstarts right now

From paper design to paper writing, dramatically reduce your research time.

No credit card · Free plan available

This review was created by AI and reviewed by human editors.