QUICK REVIEW

[论文解读] Human Label Variation in Implicit Discourse Relation Recognition

Frances Yung, Daniil Ignatev|arXiv (Cornell University)|Feb 26, 2026

Topic Modeling被引用 0

一句话总结

简要结论：该研究比较了对话不一致感知、基于分布的和透视主义模型在 DiscoGeM 数据集上的英文隐式话语关系识别，结果显示标签分布训练能获得更稳定的预测，而基于注释者的模型在细粒度层面由于认知驱动的不一致性而表现不佳。

ABSTRACT

There is growing recognition that many NLP tasks lack a single ground truth, as human judgments reflect diverse perspectives. To capture this variation, models have been developed to predict full annotation distributions rather than majority labels, while perspectivist models aim to reproduce the interpretations of individual annotators. In this work, we compare these approaches on Implicit Discourse Relation Recognition (IDRR), a highly ambiguous task where disagreement often arises from cognitive complexity rather than ideological bias. Our experiments show that existing annotator-specific models perform poorly in IDRR unless ambiguity is reduced, whereas models trained on label distributions yield more stable predictions. Further analysis indicates that frequent cognitively demanding cases drive inconsistency in human interpretation, posing challenges for perspectivist modeling in IDRR.

研究动机与目标

Motivate the need to handle disagreement in implicit discourse relation recognition (IDRR) rather than relying on a single ground truth.
Evaluate how different modeling paradigms (disagreement-aware, distribution-focused, and perspectivist) perform on IDRR tasks.
Assess performance across different label granularities (Level-1 with 5 classes and Level-2 with 17 classes).
Analyze sources of annotator disagreement and cognitive demands that affect model learning in IDRR.

提出的方法

Utilize the DiscoGeM dataset with multi-annotator labels for English implicit DRs.
Train and compare three task setups: single-label prediction, label-distribution prediction, and annotator-specific label prediction.
Implement RoBERTa-base as the shared backbone across architectures.
Evaluate three soft-label approaches: multi-label BCE, label-dist with KL-divergence, and ST-based baselines.
Compare perspectivist models (MT and AE) against non-perspectivist baselines and majority-vote baselines.
Report results using macro-F1, accuracy, cross-entropy, Jensen–Shannon divergence, Manhattan distance, and Euclidean distance metrics.

实验结果

研究问题

RQ1How do disagreement-aware, distribution-based, and perspectivist models perform on IDRR across coarse (Level-1) and fine (Level-2) granularities?
RQ2Does learning from label distributions or annotator-specific predictions yield more robust IDRR performance, especially under high label ambiguity?
RQ3What cognitive or annotation factors drive human disagreement in IDRR, and how do they affect model predictability?
RQ4Can perspectivist models feasibly predict fine-grained annotator interpretations, or are distribution-focused approaches more reliable in realistic settings?

主要发现

Disagreement-aware approaches improve accuracy for majority-label (single-label) predictions.
Annotator-specific models perform well only at Level-1 but deteriorate at Level-2 as the number of classes increases.
Learning from label distributions (soft-label) provides the strongest performance for distribution prediction, especially when annotators are unknown.
Perspectivist models can predict distributions but lose effectiveness with finer granularity due to inconsistent annotator behavior in cognitively demanding cases.
Annotator consistency and agreement with the majority vary across workers, with high disagreement correlating with reduced perspectivist model performance.

更好的研究，从现在开始

从论文设计到论文写作，大幅缩短您的研究时间。

无需绑定信用卡

本解读由 AI 生成，并经人工编辑审核。