[Paper Review] All-in-one: Multi-task Learning for Rumour Verification
The paper proposes a multi-task learning framework that jointly trains veracity classification with auxiliary tasks (rumour detection and stance classification) to improve rumour verification performance on RumourEval and PHEME datasets, and analyzes data properties that affect multi-task gains.
Automatic resolution of rumours is a challenging task that can be broken down into smaller components that make up a pipeline, including rumour detection, rumour tracking and stance classification, leading to the final outcome of determining the veracity of a rumour. In previous work, these steps in the process of rumour verification have been developed as separate components where the output of one feeds into the next. We propose a multi-task learning approach that allows joint training of the main and auxiliary tasks, improving the performance of rumour verification. We examine the connection between the dataset properties and the outcomes of the multi-task learning models used.
Motivation & Objective
- Motivate and formalize rumour resolution as a multi-task learning problem where veracity is the main task and auxiliary tasks can boost performance.
- Investigate how jointly training veracity with stance and/or detection affects verification accuracy and macro-F scores.
- Assess how dataset properties (entropy, kurtosis, token-type ratio) relate to multi-task learning gains.
- Compare multi-task models against strong baselines, including a state-of-the-art veracity classifier and majority baselines.
- Explore the effect of using different dataset splits (RumourEval and PHEME with leave-one-event-out) on model performance.
Proposed method
- Use a sequential branch-based LSTM architecture to model rumours as tweet branches.
- Employ hard parameter sharing in a multi-task setting with task-specific output layers for veracity, stance, and detection.
- Train with a combined loss that sums the task losses; skip losses for unlabeled tasks in a given instance.
- Evaluate using accuracy and macro-averaged F1, with macro-F being the primary metric for imbalanced data, and perform leave-one-event-out cross-validation on PHEME.
Experimental results
Research questions
- RQ1Does multi-task learning that combines veracity with stance and/or detection improve veracity classification over single-task learning?
- RQ2Which auxiliary task configuration (stance, detection, or both) yields the best veracity performance?
- RQ3How do dataset properties influence the effectiveness of multi-task learning in rumour verification?
- RQ4How does performance vary across RumourEval and different PHEME event splits (5 vs 9 events)?
Key findings
- Multi-task models consistently improve over single-task veracity classifiers on PHEME and RumourEval datasets.
- A three-task setup (veracity, stance, and detection) yields the strongest improvements over single-task baselines.
- MTL2 (Veracity+Stance or Veracity+Detection) outperforms the single-task branchLSTM, with MTL3 (all three tasks) providing further gains.
- Results align with prior work suggesting dataset properties (entropy, kurtosis) influence multi-task benefits, especially when auxiliary tasks have lower kurtosis than the main task.
- On RumourEval, multi-task learning surpasses NileTMRG* and branchLSTM baselines; on PHEME, MTL3 achieves the best overall macro-F and accuracy among the tested configurations.
- Performance varies by event in PHEME, with the Ferguson event being particularly challenging and differences observed in per-class predictions (true/false/unverified).
Better researchstarts right now
From paper design to paper writing, dramatically reduce your research time.
No credit card · Free plan available
This review was created by AI and reviewed by human editors.