QUICK REVIEW

[論文レビュー] GenericsKB: A Knowledge Base of Generic Statements

Sumithra Bhakthavatsalam, Chloe Anastasiades|arXiv (Cornell University)|May 2, 2020

Natural Language Processing Techniques参考文献 19被引用数 44

ひとこと要約

GenericsKBは、自然発生的な汎用文の大規模コレクション（3.5M+）と、トピックメタデータと学習済み信頼度を備えたもの、さらに合成汎用文を強化したGenericsKB-Best（1M+の文）を組み合わせたもので、QAの説明を改善し、より大規模な汎用コーパスを使用する場合と比べて推論タスクの性能を向上させることができます。

ABSTRACT

We present a new resource for the NLP community, namely a large (3.5M+ sentence) knowledge base of *generic statements*, e.g., "Trees remove carbon dioxide from the atmosphere", collected from multiple corpora. This is the first large resource to contain *naturally occurring* generic sentences, as opposed to extracted or crowdsourced triples, and thus is rich in high-quality, general, semantically complete statements. All GenericsKB sentences are annotated with their topical term, surrounding context (sentences), and a (learned) confidence. We also release GenericsKB-Best (1M+ sentences), containing the best-quality generics in GenericsKB augmented with selected, synthesized generics from WordNet and ConceptNet. In tests on two existing datasets requiring multihop reasoning (OBQA and QASC), we find using GenericsKB can result in higher scores and better explanations than using a much larger corpus. This demonstrates that GenericsKB can be a useful resource for NLP applications, as well as providing data for linguistic studies of generics and their semantics. GenericsKB is available at https://allenai.org/data/genericskb.

研究の動機と目的

NLPと言語学のための自然発生的な汎用文コーパスを提供すること。
汎用文にトピックメタデータ、周囲の文脈、学習済み信頼度を付与すること。
WordNetおよびConceptNetからの合成汎用文を用いて拡張した高品質サブセット（GenericsKB-Best）を公開すること。
GenericsKBの下流タスク（質問応答や説明生成など）での有用性を実証すること。

提案手法

3つのコーパス（Waterloo、SimpleWikipedia、ARC）から合計1.7B文の文を収集すること。
正規表現、長さヒューリスティクス、言語検出を用いてノイズをクレンジング・フィルタリングすること。
27の手作成のレキシコ構文規則でスタンドアロンの汎用文を同定すること。
BERT分類器を用いて、一般的真実として有用性のクラウドソース判断で訓練し、候補汎用文をスコア付けすること。

実験結果

リサーチクエスチョン

RQ1高品質かつ文脈的に完全な状態で自然発生的な汎用文コーパスを構築できるか。
RQ2GenericsKBは、より大規模な汎用コーパスと比較して、既存のマルチホップ推論タスクで性能や説明を改善できるか。
RQ3QAと説明タスクに対するGenericsKBの品質と有用性はどの程度か。

主な発見

コーパス	サイズ	OBQAのスコア（テスト）
QASC-17M	17M	0.660
GenericsKB	3.4M	0.632
GenericsKB-Best	1M	0.678

最終的なGenericsKBには、トピックメタデータ、文脈、信頼度スコアを含む3,433,000文が含まれる。
GenericsKB-Bestには1,020,868件の汎用文が含まれ、うち774,621件はGenericsKB、246,247件は合成であり、WordNet/ConceptNetデータを付加している。
OpenBookQAでは、GenericsKB-BestがGenericsKB（0.632）およびQASC-17Mベースライン（0.660）より高いQA性能（0.678）を示す。
GenericsKB-BestはQASCでの2段階推論の説明を大幅に改善している（0.61対0.44：QASC-17M、他の指標では0.79対0.66）。
注釈/品質チェックは、GenericsKB-Bestサンプルの有用性基準に対する約85％の一致を示し、文脈的漏洩やあいまいさが比較的低いことを示唆する。

より良い研究を、今すぐ始めましょう

論文設計から論文執筆まで、研究時間を劇的に削減しましょう。

クレジットカード登録不要

このレビューはAIが作成し、人間の編集者が確認しました。