QUICK REVIEW

[论文解读] The CL-SciSumm Shared Task 2018: Results and Key Insights

Kokil Jaidka, Michihiro Yasunaga|arXiv (Cornell University)|Sep 2, 2019

Biomedical Text Mining and Ontologies参考文献 9被引用 37

一句话总结

这篇论文报告了 CL-SciSumm 2018 共享任务的官方结果，这是一个在计算语言学领域的中等规模科学文献摘要基准，包含 60 RP–CP 注释集和三种摘要类型，在 Task 1A/B 以及可选的 Task 2 上进行评估。

ABSTRACT

This overview describes the official results of the CL-SciSumm Shared Task 2018 -- the first medium-scale shared task on scientific document summarization in the computational linguistics (CL) domain. This year, the dataset comprised 60 annotated sets of citing and reference papers from the open access research papers in the CL domain. The Shared Task was organized as a part of the 41st Annual Conference of the Special Interest Group in Information Retrieval (SIGIR), held in Ann Arbor, USA in July 2018. We compare the participating systems in terms of two evaluation metrics. The annotated dataset and evaluation scripts can be accessed and used by the community from: \url{https://github.com/WING-NUS/scisumm-corpus}.

研究动机与目标

To evaluate automatic summarization of scientific papers in CL via citance-based linking and facet labeling.
To compare systems using sentence overlap, ROUGE, and facet classification metrics.
To generate structured, concise summaries from cited text spans and analyze cross-system performance.
To expand resources and evaluation tools for scholarly summarization in the CL domain.

提出的方法

Participants built systems for Task 1A (cited text spans matching citances) and Task 1B (discourse facet classification) using a mix of lexical, statistical, and neural approaches.
Task 2 (structured RP summary) was optional and evaluated against abstract, community, and human summaries using ROUGE measures.
Evaluation used sentence-overlap F1 and ROUGE-2/ROUGE-SU4 for Task 1A and 1B, and multi-label precision/recall/F1 for Task 1B, with averaged scores over three annotation sets.
The corpus comprises training data from ACL Anthology and a test set from ACL Anthology Network, with three independent annotations per RP/CP and per summary.
Optional Task 2 summary generation was constrained to 250 words and evaluated against multiple gold standards.

实验结果

研究问题

RQ1What is the best approach to map citances to their referenced text spans in papers (Task 1A)?
RQ2How accurately can systems assign discourse facets to cited text spans (Task 1B)?
RQ3Can a structured, short summary of a reference paper be generated from cited text spans (Task 2) and how does ROUGE compare to human references?

主要发现

NUDT and CIST achieved top performance on Task 1A sentence overlap and ROUGE-based metrics for Task 1A.
Klick Labs led the ROUGE–2 based evaluation in Task 1A across certain configurations.
Task 1B results were led by CIST and NJUST across multiple runs, with Klick Labs as a notable runner-up.
For Task 2, TALN-UPF performed best against abstracts and human summaries, while NLP-NITMZ excelled against community summaries.
Overall, the results suggest lexical and similarity-based features remain strong, with potential gains from domain-specific embeddings in deep learning.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。