QUICK REVIEW

[論文レビュー] TabFact: A Large-scale Dataset for Table-based Fact Verification

Wenhu Chen, Hongmin Wang|arXiv (Cornell University)|Sep 5, 2019

Advanced Text Analysis Techniques参考文献 40被引用数 181

ひとこと要約

TabFactは大規模な表ベースの事実検証データセットを導入（118kの声明、16kのWikipedia表）と、Table-BERTとLatent Program Algorithm（LPA）という2つの強力なベースラインモデルを提示し、半構造化された証拠に対する言語的推論と象徴的推論を扱う。

ABSTRACT

The problem of verifying whether a textual hypothesis holds based on the given evidence, also known as fact verification, plays an important role in the study of natural language understanding and semantic representation. However, existing studies are mainly restricted to dealing with unstructured evidence (e.g., natural language sentences and documents, news, etc), while verification under structured evidence, such as tables, graphs, and databases, remains under-explored. This paper specifically aims to study the fact verification given semi-structured data as evidence. To this end, we construct a large-scale dataset called TabFact with 16k Wikipedia tables as the evidence for 118k human-annotated natural language statements, which are labeled as either ENTAILED or REFUTED. TabFact is challenging since it involves both soft linguistic reasoning and hard symbolic reasoning. To address these reasoning challenges, we design two different models: Table-BERT and Latent Program Algorithm (LPA). Table-BERT leverages the state-of-the-art pre-trained language model to encode the linearized tables and statements into continuous vectors for verification. LPA parses statements into programs and executes them against the tables to obtain the returned binary value for verification. Both methods achieve similar accuracy but still lag far behind human performance. We also perform a comprehensive analysis to demonstrate great future opportunities. The data and code of the dataset are provided in \url{https://github.com/wenhuchen/Table-Fact-Checking}.

研究の動機と目的

半構造化された証拠（表）を用いた事実検証を従来の無構造テキストよりも重視して研究する。
ENTAILEDまたはREFUTEDにラベル付けされた、表に裏打ちされた高品質な大規模データセットを作成する。
言語推論と象徴的表推論を行えるモデルを開発・比較する。

提案手法

WikiTablesからTabFactを構築し、16kの表と118kの人間注釈付き声明をENTAILEDまたはREFUTEDとしてラベル付けする。
アーティファクトを緩和するための2チャネルの収集とネガティブ書換戦略を用いて注釈付けする。
表を線形化して事前学習済み言語モデルを用いたNLI風の検証を行うTable-BERTを提案する。
潜在プログラム探索を行い、プログラム仮説をランク付けする識別子を持つLatent Program Algorithm（LPA）を提案する。
単純なテスト分割と複雑なテスト分割、および人間の性能と比較して、両アプローチを評価する。

実験結果

リサーチクエスチョン

RQ1半構造化された表証拠上で事実検証を効果的に実施できるか。
RQ2表ベースの検証タスクにおける言語推論と象徴推論はどう相互作用するのか。
RQ3TabFactに対するニューラル推論とプログラム合成ベースのアプローチの強みと限界は何か。
RQ4Table-BERTとLPAはTabFactで人間レベルの精度にどれだけ近づくか。
RQ5リンク付け、探索、および推論ステップにおけるエラー分析と人間評価からどんな洞察が得られるか。

主な発見

Model	Val	Test	Test (simple)	Test (complex)	Small Test
BERT classifier w/o Table	50.9	50.5	51.0	50.1	50.4
Table-BERT-Horizontal-F+T-Concatenate	50.7	50.4	50.8	50.0	50.3
Table-BERT-Vertical-F+T-Template	56.7	56.2	59.8	55.0	56.2
Table-BERT-Vertical-T+F-Template	56.7	57.0	60.6	54.3	55.5
Table-BERT-Horizontal-F+T-Template	66.0	65.1	79.0	58.1	67.9
Table-BERT-Horizontal-T+F-Template	66.1	65.1	79.1	58.2	68.1
NSM w/ RL (Binary Reward)	54.1	54.1	55.4	53.1	55.8
NSM w/ LPA-guided ML + RL	63.2	63.5	77.4	56.1	66.9
LPA-Voting w/o Discriminator	57.7	58.2	68.5	53.2	61.5
LPA-Weighted-Voting	62.5	63.1	74.6	57.3	66.8
LPA-Ranking w/ Discriminator	65.2	65.0	78.4	58.5	68.6
LPA-Ranking w/ Discriminator (Caption)	65.1	65.3	78.7	58.5	68.9
Human Performance	-	-	-	-	92.1

TabFactは16,573の表にわたり118,275の注釈付き声明を含み、アノテータ間の高い一致度（Fleiss κ = 0.75）を示す。
2つのベースラインモデルは同程度の精度を達成するものの、単純な分割と複雑な分割の両方で人間の性能には及ばない。
Table-BERTは自然言語の表テンプレートと水平・垂直の線形化の恩恵を受け、最良の変種は素朴なベースラインよりも大きな利得を得る。
LPAは表上で実行可能なプログラムに声明を変換し、識別子を用いて一貫したトレースを選択することで競合的な結果を達成する。
人間評価はリンク付けとプログラム探索の限界（正しくリンクされた割合約58%、真のプログラムの再現性約51%）を示し、偽りの推論が大きな課題であることを浮き彫りにする。
総じて、どちらのアプローチも表ベースの事実検証の実現可能性を示す一方で改善の余地が大きい。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。