QUICK REVIEW

[論文レビュー] BanFakeNews: A Dataset for Detecting Fake News in Bangla

Zobaer Hossain, Ashraful Rahman|arXiv (Cornell University)|Apr 19, 2020

Misinformation and Its Impacts参考文献 47被引用数 70

ひとこと要約

BanFakeNews を紹介する。約50Kの Bangla 偽ニュースデータセットに注釈と、語彙的・統語的・意味論的特徴およびトランスフォーマーモデルを用いたベンチマークを提供する。人間のベースラインと最先端のNLP技術に対する比較分析を提供。

ABSTRACT

Observing the damages that can be done by the rapid propagation of fake news in various sectors like politics and finance, automatic identification of fake news using linguistic analysis has drawn the attention of the research community. However, such methods are largely being developed for English where low resource languages remain out of the focus. But the risks spawned by fake and manipulative news are not confined by languages. In this work, we propose an annotated dataset of ~50K news that can be used for building automated fake news detection systems for a low resource language like Bangla. Additionally, we provide an analysis of the dataset and develop a benchmark system with state of the art NLP techniques to identify Bangla fake news. To create this system, we explore traditional linguistic features and neural network based methods. We expect this dataset will be a valuable resource for building technologies to prevent the spreading of fake news and contribute in research with low resource languages.

研究の動機と目的

Bangla 偽ニュースリソースの不足に対処するため、公開される注釈付きデータセット (~50K) を提供する。
Bangla 偽ニュース検出を、幅広い言語特徴とニューラルアーキテクチャを用いてベンチマークする。
このデータセットに対する人間の性能ベースラインを評価し、精度に影響を与える要因を分析する。
Bangla における特徴の有効性（語彙的・統語的・意味論的・メタデータ）と事前学習言語モデルに関する洞察を提供する。
将来的にはラベル付き例を50Kへ拡張し、追加の特徴（ソースメタデータ）を含めることを奨励する。

提案手法

偽ニュースデータセットを、ヘッドライン・記事・ドメイン・メタデータを含む、 authentic・fake・クリックベイト・風刺・誤解を招く内容で組み立てる。
従来の言語特徴を抽出する：単語および文字n-gramをTF-IDFで、品詞タグ頻度、単語埋め込み（FastText Bangla および News embeddings）、句読点とメタデータ（サイトのAlexaランク、記事の長さ）。
従来の機械学習モデル（SVM、RF、LR）を言語特徴で訓練し、70:30 の訓練-テスト分割、10% の検証を保持。
ニューラルネット（CNN、Bi-LSTM with attention）と事前学習済みマルチリンガルBERTを用いてテキスト分類を実験し、Adam、学習率2e-5、バッチサイズ32で調整。Micro-F1と偽クラスのクラス別F1で評価。
Bangla データセットのベースライン比較には HuggingFace Transformers を介したマルチリンガルBERTモデルを用いる。

実験結果

リサーチクエスチョン

RQ1偽ニュース検出にとって最も有力な Bangla の言語特徴は何か。
RQ2従来の語彙・統語・意味論的特徴は Bangla の偽ニュース検出でニューラルモデルを上回れるか。
RQ3前学習済みマルチリンガル変換器（BERT）は Bangla 偽ニュース検出で、線形分類器と比較してどうか。
RQ4このデータセットに対する人間の性能ベースラインはどの程度か。
RQ5メタデータ（ソース、ヘッドライン-本文関係、サイトの人気度）は検出性能にどのように影響するか。

主な発見

モデル	P (全体)	R (全体)	F1 (全体)	P (偽)	R (偽)	F1 (偽)
ベースライン - 大多数	0.97	1.00	0.99	0.00	0.00	0.00
ベースライン - ランダム	0.97	0.50	0.66	0.03	0.50	0.05
ユニグラム (U)	0.99	0.99	0.99	0.99	0.71	0.83
ビグラム (B)	0.98	0.99	0.99	0.97	0.42	0.59
トライグラム (T)	0.98	0.99	0.98	0.74	0.31	0.44
C3-gram(C3)	0.99	0.99	0.99	0.98	0.82	0.89
C4-gram(C4)	0.99	0.99	0.99	0.99	0.78	0.87
C5-gram(C5)	0.99	0.99	0.99	1.00	0.74	0.85
C3+C4+C5	0.99	0.99	0.99	1.00	0.77	0.87
All Lexical(L)	0.99	0.99	0.99	1.00	0.76	0.86
POS tag	0.97	1.00	0.98	1.00	0.00	0.01
L+POS	0.99	0.99	0.99	0.99	0.76	0.86
Embedding(F)	0.98	0.99	0.99	0.94	0.33	0.49
Embedding(N)	0.98	0.99	0.99	0.84	0.32	0.46
L+POS+E(F)	0.99	0.99	0.99	0.99	0.77	0.87
L+POS+E(N)	0.99	0.99	0.99	0.98	0.79	0.88
MP	0.97	0.99	0.98	0.94	0.15	0.27
L+POS+E(F)+MP	0.99	0.99	0.99	0.99	0.84	0.91
L+POS+E(N)+MP	0.99	0.99	0.99	0.98	0.84	0.91
All Features	0.99	0.99	0.99	0.98	0.84	0.91
CNN	0.98	1.00	0.99	0.79	0.41	0.59
LSTM	0.99	0.99	0.99	0.69	0.44	0.53
BERT	0.99	1.00	0.99	0.80	0.60	0.68

偽クラスのピークF1は、すべての言語特徴を用いたSVMベースのモデルで0.91に達する。
語彙的および文字n-gram特徴は、Bangla の偽ニュース検出において一般的に他の特徴セットよりも優れている。
ニューラルモデル（CNN、Bi-LSTM、BERT）はベースラインを改善するが、すべての言語特徴を用いた最良の線形分類器（SVM）には及ばない（BERT は約0.68 の偽F1、All Features with SVM は 0.91）。
人間のベースラインは、アノテーターにより偽クラスF1が約0.58–0.70の範囲、アノテーター間の一致率（Fleiss’ Kappa）約0.389。
文字レベルの特徴は特に有効であると強調されており、今後のニューラルアーキテクチャへの統合が示唆される。
データセットには、注釈付きソースメタデータ、ヘッドライン-本文関係、サイトの人気度などの予測的特徴が含まれている。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。