[论文解读] Textual misinformation on Reddit
tldr: 论文提出了 Fakeddit,一个来自 Reddit 的大规模多模态假新闻数据集,具有 2、3、6 路细粒度标签,并证明多模态文本+图像模型能提升假新闻检测。
Fake news has altered society in negative ways in politics and culture. It has adversely affected both online social network systems as well as offline communities and conversations. Using automatic machine learning classification models is an efficient way to combat the widespread dissemination of fake news. However, a lack of effective, comprehensive datasets has been a problem for fake news research and detection model development. Prior fake news datasets do not provide multimodal text and image data, metadata, comment data, and fine-grained fake news categorization at the scale and breadth of our dataset. We present Fakeddit, a novel multimodal dataset consisting of over 1 million samples from multiple categories of fake news. After being processed through several stages of review, the samples are labeled according to 2-way, 3-way, and 6-way classification categories through distant supervision. We construct hybrid text+image models and perform extensive experiments for multiple variations of classification, demonstrating the importance of the novel aspect of multimodality and fine-grained classification unique to Fakeddit.
研究动机与目标
- Address the limitations of existing fake news datasets by providing a large-scale multimodal dataset with fine-grained labels.
- Enable development of robust fake news detectors that utilize text, images, and metadata from social media.
- Assess the impact of multimodality on classification performance across varying label granularities.
- Offer insights for implicit fact-checking and potential downstream applications using comments and metadata.
提出的方法
- Assemble a large-scale multimodal dataset (text, images, comments, metadata) from Reddit across 22 subreddits with distant supervision labeling.
- Provide 2-way, 3-way, and 6-way fake news labels per sample to support both binary and fine-grained classification.
- Extract text embeddings using InferSent and BERT; extract image features using VGG16, ResNet50, and EfficientNet.
- Combine text and image features through a trainable dense layer and merging strategies (add, concatenate, maximum, average).
- Tune hyperparameters with Hyperband; optimize hidden layer size and learning rate; report results on validation and test splits.
- Evaluate text-only, image-only, and multimodal (text+image) configurations across 2-, 3-, and 6-way classifications.
实验结果
研究问题
- RQ1How does multimodal data (text and image) affect fake news detection performance compared to text-only or image-only baselines?
- RQ2What is the impact of fine-grained (2-, 3-, 6-way) labeling on detection accuracy?
- RQ3Can distant supervision from Reddit subreddits yield credible labels for large-scale fake news datasets?
- RQ4How do different image/text feature extractors and fusion strategies compare in multimodal fake news classification?
主要发现
| 组合方法 | 2-way 验证 | 2-way 测试 | 3-way 验证 | 3-way 测试 | 6-way 验证 | 6-way 测试 |
|---|---|---|---|---|---|---|
| Maximum (BERT+ResNet50) | 0.8929 | 0.8909 | 0.8905 | 0.8890 | 0.8600 | 0.8588 |
- Multimodal models (text+image) outperform text-only and image-only baselines across 2-, 3-, and 6-way tasks.
- BERT text features combined with ResNet50 image features using the maximum fusion method achieved the strongest overall performance, with 6-way accuracy around 0.859–0.889 range depending on split.
- Text features generally yielded stronger signals than image features alone, and combining both yielded the best results.
- The dataset contains 1,063,106 samples with 628,501 fake and 527,049 true samples, including 682,996 multimodal samples.
- Quality assurance and distant supervision introduce noise but provide a scalable approach to labeling in large-scale multimodal data.
更好的研究,从现在开始
从论文设计到论文写作,大幅缩短您的研究时间。
无需绑定信用卡
本解读由 AI 生成,并经人工编辑审核。